Put It In Reverse: Flipping the Softmax Function on its Head

January 11, 2019

In receipt and invoice image understanding there’s a prevalent problem: Information Extraction, trying to extract pieces of information from the scanned receipt or invoice image. What is “Interesting Information” is up to the application developer, but some very common fields are: Total amount, Date, Vendor, and others. In WAY2VAT we are interested in everything the receipt can provide, the aforementioned fields and also: VAT\GST paid, line items, type of expense, vendor’s VAT\GST number, and others. This type of problem implies there is a location in the receipt that holds a piece of information, although we know in most cases there are multiple such locations. The total amount will very often appear in more than one place, for example in the following receipt it appears multiple times.

This multiplicity matter warrants a selection process: Which of the candidate Total Amounts is the right one? Now, one thing we must keep in mind when working with OCRed receipt images, is that most of the time OCR errors corrupt the true printed words and sums. It is therefore not enough to simply “catch at least one of the total amounts and be fine”, we must be able to find “the best one”, which is least corrupted and the most likely to be the true Total Amount. If we argue only one appearance is the “best”, then there must be a way to score them all and then select the best according to that score.

The Softmax Function

Back to the machine learning world. When making decisions in multi-class classification problems (such as the one we have), one common way is to employ the Softmax function, which is a generalization of the Sigmoid actiVAT\GSTion function, used for just binary (two-way) classification. The softmax function is formulated as such:

Formula 2

K is the number of classes. The zk’s are the scores we mentioned before. Each field in the receipt image will get a score that essentially says, “how much is this field like a: Total Amount, VAT\GST paid, Date …”. This list makes up for the zk’s vector (of length K).

In our case we have multiple classes, e.g. Total Amount, Date, VAT\GST paid, etc., and we assume the classes are mutually exclusive on the field level, or in other words, one field in the receipt cannot fulfill more than one class. For example, one sum in the receipt cannot simultaneously be Total Amount and VAT\GST Paid, that doesn’t make sense at all! So, if we have a smart machine that magically gives us the zk’s (the scores), the softmax function will turn that vector into a probability distribution function (sums up to 1). And that is mostly it, beyond this point we simply pick the highest scoring element, an “argmax” operation.

However, recall our problem, we have multiple instances that will likely hit the Total Amount classification with high probability. In many cases this will be true to the document, in fact we are expecting many such classifications in most receipts. But we still do not know how to choose between them. Surely, we may use the “logits”, the softmax-transformed probability scores, to find the one instance with the highest probability – but that is not a fair comparison between them. If we do that, we are not ranking all the Total Amount occurrences one vs. the other, we are comparing them individually vs. other categories.

Racing Horses

To illustrate why this is a potentially problematic situation, here’s an analogy from the horse racing world. Imagine you have 5 horses, and you run them all in 100 races in different places in the world. The horses are our categories: Total, VAT\GST, Date, etc., and the races are the image locations. For each image location we race our 5 horses to see who wins. This is standard softmax.

Now, instead of asking as usual in horse racing: “Which horse is the best in race X?”, our goal question is: “Which race is the best for horse A?” (analogous to finding the best image location for field A), and equivalently for all horses. We could simply look at all the races in which horse A came in first (“argmax”), and then decide between these races in which race horse A came in with the biggest margin. But this is not a fair comparison, because different races have different features. One race track is flat, another is hilly, and yet another is grassy while the others are a dirt track. How can we compare the objective performance of horse A on uneven race conditions? Each horse has different capabilities to cope with every one of these conditions.

Instead, we can directly try to find the race in which horse A is best. So, we don’t run all the horses in all the races, we “run the races in all the horses”. The races now compete against each other, and not the horses. We rank all the races with respect to every horse and pick the best race for the horse. This is the essence of the reverse softmax selection.

Reverse Softmax

How do we put this concept in a mathematical and computational way? Putting aside the horse racing metaphor, let’s define a scoring function S that will account for all the features each word xj from receipt x can get:

Formula 3

Just a simple linear function, where the parameters θ are essentially weights for each word-feature. Next, staying within the softmax framework, we wish to convert these scores to probabilities like so:

Formula 4

In here, the probability that word j will be the Total Amount, given sample invoice x, is its softmax score over all possible words in the sample. So, the words are competing against each other here, not the labels. Hence, we expect the word that is the true Total Amount to get the maximal probability, from a softmax standpoint. From here, we define a cost function for minimization:

Formula 5

Where I is the indicator function: if word j is indeed Total Amount (TOT.AMT.) – the function returns 1. If the indicator is 1, we (negatively) add the log-prob score to the loss, or in other words, misclassified words will add a large amount to the loss if their probability is low. Conversely, we encourage the model to give high probabilities to correct classifications.
Now, it all hinges on the θ parameters, because they eventually control the probabilities. We look to find θ parameters such that we make as little misclassifications as possible, and thus we have an optimization problem on our hands. For good measure we add a regularization term on the parameters, to control overfitting. This formulation is easily derivable w.r.t to θ:

Formula 6

We can use a standard stochastic gradient descent (SGD) solver to find an optimal solution on a large batched dataset. Finally, using the probabilities we can select the best word for each label by way of an argmax.


To recap, the receipt field-extraction problem is plagued by the problem of multiple selection. Since we have multiple word-instances of the same information-type (label), without guarantees that any one of them is the right one, we cannot make a singular label-classification decision and we must consider all word candidates. To that end, we proposed a reverse classification scheme, where we optimize for selecting the best word for the label, in reverse to standard softmax-based decision.
Find out more details in our ACCV’18 paper: “Visual Linguistic Methods for Receipt Field Tagging”.

WAY2VAT Machine Learning Team.

Share this on:


Let us know how we can help you,
and we will get back to you shortly.