The Biggest Little Problems in Fully Automatic Invoice Image Analysis
At WAY2VAT we continuously improve our automatic pipeline for invoice image analysis. Over the years we found some problems are far harder than others, and there are plenty of them. The hardest problems to solve are a combination of sub-problems that all must work correctly, and often they are not the “big problems”, such as extracting data and document analysis. These problems the standard “Invoice Analysis” solutions currently do not, and probably will not, address soon. We collected a small list here with three of our favorites.
Multiple Invoice – One Image, Multiple Images – One Invoice
If you’re building an automatic pipeline for invoice image analysis you must be prepared to receive images with multiple invoices, as well as multiple images spanning a single invoice. We call such data: One I – Many I, or OI-MI (we pronounce “Oy My”), after George Takei’s “Oh My.” In our datasets we estimate a significant share of up to 15% of such data in an average incoming pipeline. Currently, invoice analysis services will expect a singular invoice in the image and a singular image, which means they cannot be applied to OIMIs. Handling OIMIs means running a segmentation algorithm to detect instances, or an algorithm to segment multiple pages to individual invoices. This in turn needs dataset, training, false-positives and no other guarantees on overall performance. A big little problem…
Handwritten for Failure
Unfortunately, the domain of handwriting OCR is not yet mature as its printed-writing OCR sibling. OCR of handwriting is offered by the major OCR-as-a-Service providers such as Google, however, the printed OCR services are more robust and have higher accuracy. Detecting the text is also just the beginning of the problem with handwritten invoices or receipts. Handwritten invoices are also likely to have less or no structure, furthermore, amounts and other key details may not be neatly aligned or completely missing. We observe a 5-10% of handwritten invoices in our standard data pipeline.
The Long-Long-Tail of Vendors
Any organization interested in insights into their invoices data is likely to look for vendor detection. The unfortunate part is that there are literally millions of vendors (that issue invoices) in any given country. Some vendors will repeat more often, such as big hotel or restaurant chains, major mailing and shipping services, transportation services to the likes of Uber, or big services companies. But, the long tail of vendors, not included in the list above, is a big part of the data pipeline. Finding any random vendor requires having a database of vendors on hand and running a query against it.
A database of all vendors worldwide may exist only in the hands of government organizations, Google or IBM, and they are not offering any access to them. The European Union has the VIES system for looking up vendors in the EU based on their VAT\GST numbers. Google provide some APIs from their maps solution, such as the Places API, which come at a cost and cannot scale easily because it supports a singular query at a time (large volume queries go through the sales team). The OpenStreetMap project has established some APIs to look for businesses, but the data may need cleaning and missing many key details. Some third-party companies provide datasets for purchase, but these can run up to $100,000s to buy! The problem of the vendor dataset is a hard one…
Summary
If you are looking at rolling your own invoice analysis system, like the one we have at WAY2VAT, you will likely run into one of the problems above. A scratch of the surface will reveal some problems are harder than they look, and there are no ready solutions. Some problems require ingenuity, creativity, and the time for research and implementation.