Visual Linguistic Methods for Receipt Field Tagging
Last week we presented in the Asian Conference on Computer Vision our paper on a new method for receipt field tagging. This method was developed in WAY2VAT over the last few years, and we felt comfortable to share it with the general computer vision community, to spur a discussion around our ideas and get involved in the academic discourse in this domain. The presentation was a smashing success, with many interested parties stopping by our poster, and we were also invited to give an aural presentation at the Combining Vision & Language workshop. In this post we would like to share with you an informal view of the paper and our proposed methods.
Why is Receipt and Invoice Understanding so Hard?
The motivation for our research stems from the difficulty of naïve methods to succeed at receipt and invoice document understanding. This problem was reasoned about for decades but is still considered an open problem that attracts many researchers and state-of-the-art document analysis methods. We believe the general problem is hard because of the incredible variance in the input. Receipt images are often handed for computational analysis in a very bad condition in terms of: lighting, tearing, crumpling, dirt, noise or printing imperfections, bad focus, missing parts, and many more problems. These problems compound with the natural way for printing receipt documents, which is: sparse, usually a dot-matrix font, has a lot of punctuation and numbers, tabular, and more. These all make reading receipt documents automatically an incredibly hard job for any OCR and language analysis approach.
Our work at WAY2VAT is trying to undo all that complication using machine learning tools. This paper we presented at ACCV’18 offers one of our approaches at doing this.
The focus of the paper is our proposal for a new way to get embeddings (vectors) for words on a receipt or invoice document. The core idea is to combine both spatial features as well as linguistic features in the embedding. This then creates a far stronger feature space for higher-level processing, as we demonstrate in the paper: invoice field tagging and OCR error correction. The reasoning behind this is that textual elements on the invoice document are related in both spatial and semantic levels. For example, the word “Total:” will likely appear to the left of the total sum, e.g. “$10.99”. Therefore, if we will be able to jointly learn the intrinsic properties of a word as well as its extrinsic relationships, we would be learning the semantics of invoice words and clear a path for invoice understanding.
The basic idea behind the Skip-Rect Embedding, as we call it, is to take lessons from Char2Vec and Word2Vec and inject geometric information into them. In Char2Vec the goal is to be robust to spelling errors, which are very common in text obtained from OCR, and avoid having an “Unknown” token that is prevalent word-oriented methods for text analysis. In our work we implement Char2Vec using a series of pooled convolutions of different apertures on the character list, after it was transformed with a vanilla fully connected embedding.
We are using Char2Vec almost exclusively for robustness to spelling errors. Word2Vec is perfectly capable of learning the different formats since none of them are one-off occurences. What want is the ability to work without ‘unknown’ tokens.
Figure 1: Char2Vec convolutional embedding.
Using another embedding layer, Word2Vec, is advantageous in invoice understanding, since many invoice fields follow particular formats: sums (e.g. $NN.NN), dates (e.g. NN-NN-NNNN), IDs, addresses, phone numbers (e.g. +NNN-N-NNN-NNNN) and general words. This embedding will learn the common formats that have more weight.
This allows us to learn word-intrinsic relationships, however for encoding the geometric relationships we employ a different path. First, we take a lesson from Word2Vec works such as skip-gram (which is the namesake for our method), however instead of looking at neighboring words in a sentence we pose geometrically-neighboring words as our unsupervised pairwise dataset. The words are encoded in the Char2Vec framework, trained separately. We train our model in an unsupervised fashion, which allows us to use a great amount of data, and also as a bonus get the network exposed to many OCR errors. In total we used more than 7 million Skip-Rect word-pairs to train our models.
Figure 2: Skip-Rect pairs creation.
OCR Error Correction with Skip-Rect Embedding
One application of this new kind of embedding is to fix errors in the OCR. Since we purposefully include in the learning set all the OCR errors, we can try to recover errors using semantic proximity. The intuition is that the meaning (the Skip-Rect Embedding vector) of the word “T0ta1”, which appeared probably where a correct word would be, will be close to the meaning of the correct word – “Total”, without OCR errors. Therefore, if we can regress outlier words to their cluster meaning, i.e. to a more “average” or “heavy” word in the high-dimensional embedding vector , we are likely to find the correct word that the OCR damaged.
We tested this hypothesis and found it to be true. Using a dataset of carefully hand-annotated receipts, we created a dictionary of corrections in the embedding hyperspace, I.e. a dictionary from embedding space to word character. Then an incoming, potentially OCR-damaged, embedded word finds its nearest neighbor in the dictionary and takes it as its correction.
are we using our warp engines to get to the embedding hyperspace?
Figure 3: Example OCR result words (green) and their corrected versions (blue). The corrections are: (a) ‘autegril?’ → ‘autogrill’, (b) ‘hancheste’ → ‘manchester’ and (c) ‘congoyrge’ → ‘concourse’
Deep Field Tagging
The goal of our work is to extract information from receipts. With the ability to give semantic information to all OCR recovered words in the document, we now should be able to use this information to “read” the receipt and find the location of important fields. This we do by utilizing convolutional hourglass networks (e.g. UNet) for pixel-level multiclass classification. In other words, for every pixel in the document image we can try to learn which field it is based on the geometric-semantic information, reinforced by its surroundings using the convolutions. The convolutions, being translation invariant, also aid in generalizing spatially across the localities of the page, meaning, we are intrinsically able to find the “Total: $10.99” if it’s situated in the left part of the image or the right part.
we are intrinsically able to
Figure 4: The embedding image. Every pixel gets the Skip-Rect Embedding vector.
To use convolutional nets, first we create an “embedding image”, where each pixel gets the embedding vector of the word below it. Then we feed it through a UNet with skip connections to get an image of the same size with K logits (probabilities of belonging to any of the K classes). This then goes through a novel “reverse softmax” that we propose, which basically flips the regular softmax problem on its head: instead of looking for the argmax class for every pixel (in essence, rectangle), we look for the “argmax” rectangle for every class. The mathematic formulation and derivatives are in the paper.
The result is not unlike a segmentation over the image, where areas that are likely to belong to a certain class are picked up by the network. A final post processing step gets us these fields’ tagging.
Figure 5: Field tagging using the embedding image.
We were able to show an increase over a state-of-the-art method from ICDAR’17 called CloudScan by Palm, Winther and Laws (2017), as well as superiority over naïve or crippled methods.
Figure 6: Comparative results of our method under different conditions. Method (c) is Palm et al. 2017.
The gory details of our new tagging method can be seen in the paper. In the meantime, we are continuously working on improving our results by incorporating more data and better methods. Look out for more innovative works from WAY2VAT’s machine learning team!
WAY2VAT Machine Learning Team