Going Beyond Pixels – Part 1: Invoice Images as Graphs
We usually think about invoice images as, well, images – a matrix of pixels, light intensity values on a regular grid. This preserves the highest fidelity of information, since all the information of the invoice is encoded in those pixel values. On the other hand, pixels, which can well be in the millions for a standard image from a smartphone, are a whole lot of data for very little information that the receipt actually holds – e.g. the amounts and dates. Therefore, invoice image analysis systems are about hierarchical abstraction: Moving information from lower-level representation to an increasingly more complex structure. In the past we modeled this abstraction visually with convolutional neural networks: https://way2vat.com/visual-linguistic-methods-for-receipt-field-tagging/.
A different way of abstracting visual information is using Graphs, a node-edge connective structure that encodes visual information by assigning nodes to spatial locations and connecting the various nodes with edges that encode the visual relationship between nodes. Graph structures have been used traditionally in computer vision and image analysis mostly to do segmentation – splitting the image to smaller parts with the same semantic meaning. So how do we use this abstraction method to analyze invoices?
In our upcoming technical paper in the ACM Symposium on Document Engineering we propose a new method for invoice analysis with graphs. The first step is thinking about the invoice as a graph, abstracting the visual structure to a node-edge representation. The question is, what parts of the invoice image should get a node? And what is the relationship between invoice nodes? We choose to represent the image using the OCR process, where each word rectangle is a node. This is a well-used method that we can see used in document analysis research as early as the INFORMys work from Cesarini et al. , and up to very recent work by Qian et al. .
Naturally, a single word in an invoice (for example “Total” or “938a”) does not carry a lot of information on its own. The true power of a hierarchical representation is by drawing edges between the nodes to connect them and allow algorithms to reason about groups of nearby nodes. The next question is: How to connect the nodes of the graph?
We take a simple approach here, connecting the nodes based on their pixel-location in the image. We consider the cardinal directions: North, South, East and West. A node will connect to its right (Eastern) neighbor if the neighbor is first node that is located along the positive x axis of the image. Similarly, we connect the left (Western) nodes, the upper (Northern) ones and those below (Southern). In the figure above we can see an example image encoded as a graph. We call this a Cardinal Graph representation of the invoice image. The cardinal graph is a convenient modality, since its very intuitive to construct and also understand visually, it simply makes visual sense. There are obvious down sides to it as well:
Currently we encode the cardinal directions only, e.g. not using diagonals.
It fails to encode connections that go beyond the first immediate neighbor, it cannot “skip” nodes, which are often errors or noise arising from the OCR.
Some cardinal connections carry very little or no information at all, for example words that are very far apart in the image and have no reason to be connected.
Other problems exist, but for the most part this representation is simple and powerful! The shortcomings of the cardinal edge structure are readily addressed by graph analysis algorithms, such as graph-cuts and belief propagation, that see past immediate node neighborhoods to find higher-level insights. Additionally, cardinal direction connection carries a strong informational cue, since it captures the relations in most invoices’ tabular layout, where the word “Total:” lies West of the numeric total sum.
Constructing the graph is just the first step, next is to use this representation for an analysis application such as information extraction. In the next part in the series we will dive deeper into graph algorithms for information extraction. Stay tuned!