About

pdf2Data is an add-on for iText7 to recognize data inside PDF documents in an intuitive and predictable manner. It provides a mechanism to extract predefined data fields from the PDF documents based on the same template (for example, an invoice coming from the same supplier).

Intelligence

The pdf2data tool uses all possible ways to find the required data imitating human way to understand the document.
The data recognition uses on a number of rules, which need to be defined in advance per each data field. Typical rules use all details from the PDF document that help to ensure the correct data extraction. This may include:
  • page range and the position on the page
  • specific font style and text patterns
  • fixed keywords next to the data
  • automatic recognition of table structures

How it works

The whole recognition is based on the following steps:

Step 1. Upload a sample PDF document (a template).
Step 2. Select data in the document you would like to extract and define relevant extraction rules (selectors) for the correct data extraction.
Step 3. Upload any other PDF document based on the same template and check if we were able to recognize your data.
Step 4. Start using the template in the pdf2Data server-side component. Integrate it into your document workflow as a Java library or as a command-line application.

Go to demo ▶