Text
Structure
All text on the page is written first, broken down into columns, paragraphs, lines, and words. If there is any text on the page, the first children of thepage
element will be one or
more column
elements.
Words
word
elements have a number of attributes and
child elements:
llx
, lly
,
urx
, ury
) uses the units
specified with the -unit
option.
The rotation angle is 0, 90, 180, or 270, indicating the text direction, in degrees counterclockwise from horizontal.
The character position is the index of the first character in the word, in content stream order. Character positions are used in highlight files generated by PDF search tools.
The font tag refers to a font in
the resources
element. The font size is in
the units specified with the -unit
option.
The underlined value will be either "true
" or
"false
".
If the pos
attribute is present, the position value will
be one of:
- "
sub
": the word is a subscript - "
super
": the word is a superscript - "
dropcap
": the word is a drop capital (at the start of a paragraph)
pos
attribute will not be present.
The space-after value, which is either "true
" or
"false
", indicates whether there is a space after this
word, before the next word on the same line. This is usually true,
but it will be false in certain cases: on the last word in a line,
between a word and a following subscript or superscript, between words
where there is a change in font (PDFdeconstruct creates separate word
elements for this), etc.
If this word is (part of) a URL-type link, the link
attribute value will be set to the link target.
All colors are converted to RGB. The RGB components use the
PDF/PostScript convention of real numbers between 0 and 1. For
example, 50% gray ("#808080
" in HTML terminology) would
be:
text
element contains the text of the word
(converted to Unicode).
Characters
If the-chars
option is used, word
elements will have additional char
children.
There will be one char
element for each character
in the word:
char
elements provide the individual bounding
boxes for each character in the word.