Considering the Allied Market Research agency report, the worldwide data extraction market was valued at $ 2.14 billion in 2019, and is expected to reach $ 4.90 billion by 2027.

Nowadays, the problem of data extraction and document understanding is critical for many businesses, including the banking, financial services, and insurance segments. Manual processing of documents has a high process cost due to a variety of reasons.

Human Cost of Document Tracking and Errors

1. Maintaining the correct document version can be difficult, especially when it is revised multiple times. If document tracking hasn’t been done correctly, it can lead to double payments, delivery of extra items, etc.

2. There are many similar documents and transactions between a frequent supplier and buyer.

3. The process cannot scale. Maintaining an optimal number of human resources is hard when the processing volume changes rapidly. Most companies have these departments overstaffed to compensate for spikes in volume.

Payment or Procurement Delays

4. Data from the documents is entered into the systems manually. This process becomes a bottleneck when the volume of documents processed increases.

5. Workflow delays can lead to delivery, payment, or procurement delays. As a result, companies face a high cost of working capital or loss of revenue due to delays in procuring raw materials, etc.

Inventory Errors

6. If inventory systems are not correctly integrated with document processing, there can be a high cost of miscalculating inventory. As a result, it leads to overstocking, duplicate orders, understocking, and loss of revenue.

Automatic OCR is a set of computer vision tasks, which converts scanned documents and images into machine-readable text. This program takes images of documents, invoices, and receipts, finds text in them, and converts it into a format that machines can better process. If you want to read the information on ID cards or read numbers on a bank cheque, OCR is what will drive your software.

In our case, OCR functionality was needed to extract structured information from invoices, receipts, and other types of customer documents. To solve the task we developed the AI (Artificial Intelligence) solution based on the LayoutLMv3 idea. To satisfy the requirements of the model input, our research will describe the approach that implies the recognition of text lines, including bounding boxes of words inside the line.

The dataset used for the benchmark consists of around 200 documents in English of the above-mentioned types. They were annotated manually by our team.

Our benchmark research will focus on the three following OCR tools.

Tesseract OCR

Tesseract is an open-source text recognition engine, which is available under the Apache 2.0 license. It can be used directly or by using an API to extract printed text from images. It supports a wide variety of languages. Tesseract does not have a built-in GUI, but there are several available from the 3rdParty page. Tesseract is compatible with many programming languages and frameworks through wrappers that can be found here. It can be used with the existing layout analysis to recognize text within a large document. Also, it can be used in conjunction with an external text detector to recognize text from an image of a single text line.

Amazon Textract

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. Textract uses ML to read and process any type of document, accurately extracting text, handwriting, and tables. Amazon Textract can detect printed text and handwriting from the standard English alphabet and ASCII symbols. Amazon Textract can extract printed text, forms, and tables in English, German, French, Spanish, Italian, and Portuguese.

Azure Computer Vision

Azure Computer Vision is an AI service that analyzes content in images and video. OCR functionality extracts printed and handwritten text from images and documents with mixed languages and writing styles.

Google Document AI

Document AI is a document understanding solution that takes unstructured data (e.g. emails, invoices, forms, other documents) and makes the data easier to understand, analyze, and consume. It also provides OCR functionality for those kinds of documents that uses ML models.

Benchmark was performed on the following metrics. Firstly, we calculate the average percentage of lines fully matching the text of the manual annotation:

where N is the number of correctly recognized lines in a document, M is the full number of lines and n represents the dataset size. Secondly, we calculate the same metric for the lines without punctuation (that may be treated differently by different OCR tools) and the same metric for the lines with normalized Levenstein distance not larger than the threshold of 0.7.

The next considered metric was the average intersection over union (IoU) among word bounding boxes. IoU is calculated by dividing the overlap between the predicted and ground truth annotation by the union of these, then the average is taken:

The results we obtained are summarized in the following table:

Although AWS Textract and Azure Computer Vision showed comparable results for the English language, we chose the Azure Computer Vision OCR functionality. It supports more languages, what is critical for our multilingual solutions, and uses state-of-the-art AI solutions.