Receipt Data Augmentation: Using Virtual People for Real Image Analysis

August 2, 2018

Finding receipts in images and analyzing their content accurately is a hard task. Hard tasks call for deep and powerful machine learning models. Deep models require tons of quality data, if one hopes to avoid overfitting in training while reaching high performance in inference. This is where Data Augmentation comes in. Augmentation is done when existing data is insufficient, incomplete, or simply if the problem calls for more variance in the data.

Usually in data augmentation, a machine learning researcher will synthesize new unseen examples from existing data they have. The augmentation will add distortions to the data (in case of an image, e.g. rotation, flip, crop, color or light change) that may appear in the incoming data but is not sufficiently represented in the training data.

Figure: A data-augmented cat, using the imgaug package
Image from:

In our case, of analyzing receipt images, this means creating more fake receipts images, using ones we already have. But with receipt images – this naïve augmentation method doesn’t work out so well, it is severely limited compared to doing the same with natural images. Rotations cause the borders of the receipt document to escape the image bounds, mirror flips and non-rigid warps destroy the text, and other ailments.

Figure: Receipt image naïve augmentation
Image from:

We therefore had to think creatively about data augmentation. So for our data augmentation pipeline, which is too big to discuss here in its entirety, we sometimes use Blender. It’s a magical open-source 3D content authoring tool that is easily extensible and scriptable using Python – that’s a very important trait in data augmentation. Blender lets us create totally virtual environments and film them with virtual cameras that produce real-looking pictures, using ray-tracing rendering algorithms. Much like how they produce animated movies!

Figure: Fake receipt image 3D scene, and result of the rendering

Using virtual scenes for data augmentation and synthetic dataset generation is very well rehearsed in computer vision. For example, the SYNTHIA dataset is an enormous and completely virtual model of a city, complete with life-like persons and even pets, that is used to produce data to train autonomous-driving algorithms. The virtual world allows for the creation of orders of magnitude of additional samples, as well as place cameras and reenact situations that will be very hard to do in real life (at such scale). It’s therefore a great deal cheaper than creating a real (vs. synthetic) dataset.

Synthetic datasets obviously have their down side. The data is mostly considered to be too “pristine”. What makes real-world datasets so appealing is that “reality is more amazing than any imagination”. In a synthetic dataset one can only get data about phenomenon they know how to model and recreate. Also, creating synthetic scenes is incredibly hard and time consuming, often only done by 3D experts (modelers and animators), who charge quite a lot for their time. Still, a single synthetic scene, which is parameterized adequately, can generate 100s or 1000s of samples.

Parametrization of the synthetic scene is important for increasing variance. If we can numerically control many aspects in the scene, we can randomize those numbers and parameters to get 10x, 100x or 1000x samples. Every parameter geometrically increases the number of samples we can get. For example, if we have 5 lighting conditions, 5 choices of background texture, 5 hand skin tones, and 25 camera angles, we can get 5 * 5 * 5 * 25 = 3,125 (!!) images from this one single scene. This gets worthwhile very quickly, but it can’t be overdone since the basic information – the receipt texture – remains fixed and doesn’t vary.

Figure: Several shots of the same receipt scene

Virtual Humans for Realistic Receipt Scans

At WAY2VAT we consume a lot of data from mobile phone scans, particularly using our proprietary smart scanner embedded in the WAY2VAT app (on Android and iOS). We therefore have a strong focus on analyzing such images: finding the receipt in the image, segmenting it out from the background, unwarping it back to a flat rectangle, and then doing OCR and text analysis. Each one of these steps has an algorithm that relies on piles of data for training, validation and testing. Very early on in our work we discovered the power of synthetic data generation, and we never stopped using it since.

Figure: Typical real image coming from a mobile

Images scanned from the mobile have a lot of variance in them. Every person has their own way of holding the paper up to the camera, for example: thumb-index grip from the bottom, thumb-index from the side, flat on palm, etc. Add to that an infinite amount of lighting conditions, from harsh low-lighting to glaring bright sunlight. There’s also an incredible variety of smartphone cameras taking pictures out there, for which we must create a unified scanning solution.

This led us to create a synthetic data augmentation pipeline for the mobile, by utilizing 3D human actors. We pose the virtual actors to grab a picture with a fake virtual phone, to simulate a real situation. We also go the extra mile and add crumpling (“high frequency”) and curl (“low frequency”) deformation to the receipt paper mesh. Blender allows for this with its excellent Modifier modules, that also lets us control the parameters of the deformation. Using all the parameters we mentioned before and these new ones, we can create 1000s of samples of real-looking receipt images captured from a mobile phone to train our algorithms.

Figure: Example renderings from our synthetic dataset. You can see variation in: skin tone, body weight, lighting angle, lighting tone and hand angle.


Python bindings for Blender (called bpy) help us script this process, which is very time consuming. Every frame rendering may take up to 1 minutes, even with a GPU. That adds up very quickly to 100s of hours of rendering. We usually run these rendering jobs over the weekend or during winter, to warm up the office.

This is what we have for you today, so have fun augmenting some data!


Share this on: