Reading for Analphabetic Computers: Automatic Receipt Image Understanding
When we look at a receipt we immediately understand what it is about. Surprisingly, we do that with very little support from the receipt document itself. Receipts have very little text, possibly not even a single valid sentence, but they are very highly structured. So us humans read receipts without actually “reading” them, we don’t parse the text into sentences and words and construct meaning – rather we rely on our big brains to help decode the information on them and extract the useful information.
In Way2VAT we analyze receipts so we can understand them, automatically. And we found that asking the computer to read a receipt is very much similar to asking a person who cannot read (a certain language) to do the same. When words have no meaning, we must rely on other cues and prior information.
Let’s play a game. Before you is a receipt image taken recently in Japan. Please extract the following information from it: (1) grand total amount paid, (2) VAT paid, (3) date of purchase, (4) what service was given? (taxi, hotel, restaurant, parking, etc.). Coincidentally, these are some of the bits of information we would need if we wanted to ask the tax authorities of Japan to reimburse us for the VAT we paid. This is not a test, the answers are below. To make things fair, we will reveal that under current YEN-USD trade rates 1,000Yen is about $9 USD.
Taking a wild guess here – you got most things right, and you did it really fast too. Well done! I didn’t know you can read Japanese. For argument’s sake, let’s say you cannot read Japanese. But still – extracting information from this receipt document was not challenging even without understanding a single word, or understanding if something is a word at all. How did you achieve such a feat in such a short amount of time?
Geometric cues for reading receipts
Beyond just reading the numbers (fortunately the numeral system in Japan is Western Arabic), what contextual information did you use? Did you examine the variation in type face (weight, size)? Location of the number on the page? Adjacency to other elements such as other numbers, text or symbols? You recalled other receipts in languages that you can read? How about the numbers – did they follow a pattern you used to make sense of them? Did you notice how knowing the Yen-USD exchange rate is imperative for understanding which expense this was?
In the former example there are many curve balls and obstacles to dodge, in terms of understanding the receipt content. First, the largest amount on the document (2,022Yen) is not the grand total, it is in fact the cash received. The grand total is also not the lowest amount in the document, like one might expect. The grand total is also not the only emphasized amount, the cash returned (405Yen) is also emphasized. This mix of cue may be confusing, but not detrimental to understanding. We probably have a very smart mechanism to avoid such obstacles – selective reading.
We believe receipt reading is a selective process. Sometimes known as “reading with a purpose”. The reader is actively looking for specific bits of information instead of methodically scanning the document top-to-bottom. Kenneth Goodman poses this in his seminal “Reading: A Psycholinguistic Guessing Game” (1967), however his original approach was highly criticized for being unscientific (he based his study on a single data point!). Still, a whole movement of understanding the psychology of reading was established from this early work.
At Way2VAT we teach computers to read without being able to “really read”. Instead of going for meaning, we decode documents on a geometric level. We extend Goodman’s ideas by establishing a “Geometric Selective Reading”. Receipt documents are teeming with geometries. Here is a little sketch of some of these geometries:
We can clearly see the tabular nature of the receipt document, and some distinct horizontal separators: some are actual dashed lines or asterisks, while others are created by whitespace. Indentation helps see what are the line items and distinguish them from the totals. The horizontal compartments help separate semantically different sections top-down: vendor information, line items, total and VAT, grand total, and finally cash transaction. All these geometric cues have nothing to do with the text, it’s simply an ink-blot negative space analysis, which is very common practice in typographic and graphic design stemming from Gestalt psychology’s figure-ground principle .
The Automatic Invoice Analyzer – A Computational Visual “Brain”
Way2VAT’s Automatic Invoice Analyzer (AIA) is a recreation of this human perceptual selective reading process in the computer. We use deep learning technology to build a perceptual pipeline that analyzes the receipt top-down and bottom-up. By top-down and bottom-up we mean the direction of hierarchical perception, rather than spatially in the image domain. Take a look at the following manipulations of the same receipt image (after binarization and perspective corrections):
These are actually standard exercises in visual analysis, often done by graphic designers and typographers. To see the complete picture, you must distance yourself from the details.
Looking top-down, we can segment the image to sections: text-heavy vs. number-heavy, left vs. right, top vs. bottom, congested vs. sparse. In practice, we train a convolutional neural network (CNN) that gets the image as the input as well as the results from running an OCR operation – a proprietary type of hOCR representation . This learner outputs a multi-class semantic segmentation map – a breakdown of the image to semantically meaningful parts. The parts are then broken down further, creating a top-down approach.
This is done in practice with convolutional and recurrent neural language models, on the character as well as the word level.
Finally, we combine the two low-level analysis approaches in a high-level model that is our “selective reader”. The reader is reading with a purpose to find the bits of information useful for Way2VAT’s VAT reclaim processor. This is the essence of Way2VAT’s leading AIA technology. Learning from the psychology of visual design and perception and the underpinnings of psycholinguistics, we built a computational receipt analyzer capable of reading without understanding the language. Just as a teaser – we do “understand” Japanese, so we can actually do much better at understanding the receipt, but you will have to wait for the details on that front.
 Answers: (1) 1,617 Yen, (2) 119 Yen, (3) April 3rd 2018, (4) Restaurant meal.
 “A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure-ground organization”, Wgemans et al., Psychological Bulletin, 2012.