Overview

Data extraction from PDF documents is based on the notion of selectors. Every selector is associated with a certain rectangular area, or selector region, in the PDF document (the PDF annotation rectangle) and is defined via a keyword followed by a number of additional parameters.

The aim of the selector is to extract some data from the PDF document using the location of the rectangle and the selector parameters.

Important note. The data itself may be located outside the annotation rectangle. This rectangle may be used in a number of ways:

  • to extract the text style of the corresponding data string,
  • to define location boundaries of the data, or
  • in other more complex selector algorithms.

Selectors are organized into a pipe line. During the data extraction algorithm, each selector receives data from the previous selector in the pipe line, converts it to the necessary format, filters the data and sends filtered data to the next selector. The result of the selecting process is the result of the last selector.

For example, prior to searching some data via a regular expression, individual symbols (or glyphs) are grouped into lines/paragraphs. At the moment the selectors operate with the following data types: charaters, lines, paragraphs, tables, pages and images.

Selectors

Below we list the details for all selectors and explain the syntax used to define them in the Expert mode. Most of these properties are available in the User mode in a user-friendly manner.

Fonts

Font selector

  • Keyword: font
  • Input data format: characters
  • Output data format: characters

The font selector identifies the font used for the text in the selector region and extracts all characters with the same font from the PDF document.

Note. If the text in the selector region uses several fonts, only the first one is used.

Hint. The font is considered as a combination of the font family, font size and font style. If only some of these properties should be honoured when extracting text, please use selectors fontFamily, fontSize, fontStyle.

Font family selector

  • Keyword: fontFamily
  • Input data format: characters
  • Output data format: characters

The font family selector identifies the font family used for the text in the selector region and extracts all characters with the same font family from the PDF document.

Note. If the text in the selector region uses several fonts, only the first one is used.

Hint. This font selector picks up only font family, ignoring font size and font style (regular, bold, italic). If all of this should be honoured, use 'font' selector instead.

Font size selector

  • Keyword: fontSize
  • Input data format: characters
  • Output data format: characters

The font size selector identifies the font size used for the text in the selector region and extracts all characters with the same font size from the PDF document.

Note. If the text in the selector region uses several font sizes, only the first one is used.

Hint. This font selector picks up only font size ignoring font family and font style (regular, bold, italic). If all of this should be honoured, use 'font' selector instead.

Font style selector

  • Keyword: fontStyle: normal | bold | italic | bold italic
  • Input data format: characters
  • Output data format: characters

The font style selector extracts all characters with the specified font style (normal, bold, italic, or bold italic). The font style property can be omitted. Then it is automatically detected by the style of the text in the selector region.

Note. If the text in the selector region uses several font styles, only the first one is used.

Hint. This font selector picks up only font style (regular, bold, italic), ignoring font family and font size. If all of this should be honoured, use 'font' selector instead.

Position on the page

Boundary selector

  • Keyword: boundary: left | right | bottom | top
  • Input data format: characters
  • Output data format: characters

This selector restricts the area on the page where the data is located.

The selector boundary without any additional properties means that only the data inside the selector region will be extracted. However, it is often useful to honour only some of the region borders. In this case these borders can be specified as additional properties of the boundary selector.

For example, boundary:left selector means that only the characters positioned to the right of the left boundary are extracted. Similarly,

  • boundary:top means that only the characters positioned below the top boundary are extracted,
  • boundary: left right means that characters between the left and right boundaries are selected, while the vertical position is ignored.

Note. The selector boundary without arguments is equivalent to boundary: left right bottom top.

Align selector

  • Keyword: align: left | right
  • Input data format: lines
  • Output data format: lines

Selects the text based on its alignment properties. The align: left selector selects all lines that start close to the left boundary of the selector region. Similarly, the align: right selector uses right bound of the selector region for extracting only lines that end near this boundary.

Text patterns

Pattern

  • Keyword: pattern: prefix="abc", suffix="xyz" pattern=date|price|iban|vat|integer
  • Input data format: lines
  • Output data format: lines

Selects the line of text with a specified prefix and / or suffix. Either prefix or suffix can be omitted.

The optional pattern keyword additionally restricts the type of contents between the prefix and the suffix. If it is not specified, any text string between given prefix and suffix will be extracted.

See Regular expression selector for more complex patterns.

Predefined fields

A number of common data fields is recognized by special selectors:

  • Date (keyword date): recognizes date string in a number of common formats
  • Price (keyword price): recognizes a decimal number, possibly with decimal separator, group separator and preceding or following currency sign
  • IBAN (keyword iban): recognizes account number in the IBAN format
  • VAT (keyword vat): recognizes a European VAT number

Regular expressions

  • Keyword: regExp: numberOfLines=2, selectLine=2, checkLocation
  • Input data format: lines
  • Output data format: lines

The regExp selector implements the standard regular expression search with a few additional options:

  • numberOfLines parameter specifies how many regular expressions are defined (optional, default value 1).
  • selectLine parameter specifies the index of the line that will be extracted as the final output (optional, default is numberOfLines, so that the last matched group is extracted).
  • checkLocation is an optional parameter. The regExp selector with this option does the search only within the text inside left and right boundaries of the selector region.

The regular expressions (numberOfLines in total) follow the regExp keyword and defined the following selection algorithm. First, all lines fare sorted from top to bottom. Then the first line matching the first regular expression is found. If there are several regular expressions defined, then the next line matching the second regular expression is found, and so on. Finally, the line with index selectLine is extracted as a result of this process.

In addition, any of the regular expressions may contain groups defined by ( and ) brackets. If this is the case, then only the string captured by the group takes part in further data extraction.

If option checkLocation is specified, then the selection algorithm only matches parts of the line that lies between the left and right boundaries of the selector region.

Note. If option checkLocation used together with grouping characters, then only the string captured by this group should lie within left and right boundaries.

Example

regExp
Invoice\s+(R\d+)

In this case there are no parameters. The selector will use only one pattern for matching and return the group that captures the pattern (\d+) meaning one or more digit.

Example

regExp:numberOfLines=2
INVOICE\s+NUMBER
(\d+)

In this case selectLine is not specified and is equal to 2 by default. The first regular expression will locate the the line with the text INVOICE\s+NUMBER (here \s+ means one or more space characters). Then the second regular expression will search for the group of digits below this line. This group of digits will form the result of data extraction process.

Example

regExp:numberOfLines=2,checkLocation
INVOICE\s+NUMBER
(\d+)

This is example is similar to the above, but with an additional checkLocation parameter. As a result the invoice number (the group of digits matched by the second regular expression) will be searched only within the left and right boundaries of the selector region.

Note. If option checkLocation is not specified, the selector region can still be controlled by the preceding boundary selector.

Tables Video

Automatic table mode

  • Keyword: table: fit
  • Input data format: lines
  • Output data format: lines or tables

Selects a single cell or a rectangular group of cells (subtable) in a table with given column headers.

If the keyword fit is specified, the table headers and the cell position (or row and column range if more than a single cell is selected) in the table is automatically determined by the selected region on the page.

Advanced table mode

  • Keyword: table: selectRow=1:2, selectColumn=1, numberOfColumns=2, mainColumn=2, format=simple|regexp
  • Input data format: lines
  • Output data format: lines or tables

Manually specifies the row and column numbers (or ranges using start:end syntax), the number of columns in the table (mandatory property numberOfColumns) and the column (mainColumn) that will be used a main reference to determine the table rows. The latter can be important if not all table rows are separated by border lines.

The keyword is followed by the names of the column headers, each on a separate line. The property format specifies how the names of the columns will be matched. By default (format=simple) the standard string comparison is used. However, one can also specify format=regexp to and specify regular expressions for column names.

The selectColumn property can take not only integer values, but also the name of the required column.

In case only a single cell (n,m) needs to be extracted from the table, one can use a property selectCell=n;m, which is equivalent to specifying selectRow=n, selectColumn=m.

Example

table: selectRow=2, selectColumn=1, numberOfColumns=2, mainColumn=2
DATE
INVOICE

This example will select the contents of the cell (2,1) in the 2-column table with column headers "DATE" and "INVOICE".

Multipage tables Video

If a table spans more than one page, the selection algorithm, it will also be be selected (as a single page), if all table columns have the same width on different pages. The repeated header and footer (if any) will be removed from the final selection result, so that only the first header and the last footer are retained.

The multipage selction algorithm also detects and ignores any page headers or footers.

For better results in case of multipage tables, we recommend to use advanced table mode and specify all table headers explicitly. See the above example.

Global selectors

Pages

  • Keyword: page: N:M
  • Output data format: pages

Selects pages from M to N in the document. The number -1 means the last page. To select only one page, the second number can be omitted.

Example

page: 2:-1

Selects all pages except the first one.

Images Video

  • Keyword: image: index=1:2, width=30mm:200mm, height=0mm:1000mm
  • Input data format: pages
  • Output data format: images

Selects one or several images in the document that fit into the specified width and height range. Can be preceeded with the page or the boundary selector to select images only on a certain page(s) or within a certain page region. The property index specifies specifies the number (or a continuous range) of the image to be selected, in case is more than one image that fits into the specified criteria.

The properties width and height use physical dumensions such as pt (default), mm, cm, in.

Example

image: index=1:-1

This will select all images in the document independently of their width and height.

Utilities

Picker

  • Keyword: pick: type=Character|Paragraph|Line, N
  • Input data format: characters, lines, paragraphs
  • Output data format: characters, lines, paragraphs

Picks the N-th object out of the currently selected sequence of objects. Property type is optional. If it is specified, the incoming sequence is grouped into requested new object types before the N-th object (of the required type) is picked.

Example

pick: type=Paragraph, 2

Groups the previously selected sequence of characters or lines into paragraphs and then picks the second paragraph as a final result.

Paragraph

  • Keyword: paragraph: lineSpacing=Normal|Large|Huge, named="abc"
  • Input data format: characters, lines, paragraphs
  • Output data format: paragraphs

Groups incoming characters or lines into paragraphs using the specified line spacing property (lineSpacing). If it is omitted, Normal line spacing is used. Optionally the paragraph name can be specified. In this case only paragraphs with the first line matching this name will be selected.

Line

  • Keyword: line: charSpacing=Normal|Large|Huge
  • Input data format: characters, lines, paragraphs
  • Output data format: lines

Groups incoming characters into lines or splits paragraphs into lines. Property charSpacing specifies the distance between two adjoint characters that is treated as a space:

  • Normal means 1 x (space character width in the current font)
  • Large means 2 x (space character width in the current font)
  • Huge means 4 x (space character width in the current font)

If charSpacing property is omitted, Normal character spacing is used.

Extracting none or multiple data records per field

By default the extraction algorithm expects to find a single data record per each field in the template. However, the user can indicate that:

  • the presence of the field is optional (Check box Zero results allowed). If this option is enabled, missing data for a given field would not be considered as an error.
  • there might be more than a single record for a given field (Check box Multiple results allowed). If this option is enabled, returning more than one record for a field would be considered as a normal situation.

If these options are not enabled and the extraction algorithm finds none or more than one record for this field, it will report a warning.