Using PDFdeconstruct
Synopsis
Basic usage looks like:pdfdeconstruct [options] PDF-file output-dir
For example:
pdfdeconstruct test.pdf testout
will create a directory called "testout
", containing a
"doc.xml
" along with any extracted fonts and images.
The options described below can be used to modify the output.
The output format is described in detail in PDFdeconstruct Output.
Options
The following command line options are available:-output type,type,...
- Select the types of output to be included in the generated XML.
The argument is a comma separated list (e.g., "
-output outline,textext,vector
"), including any of the following:outline
: include the document outline (aka bookmarks)textext
: include the extracted text (column/paragraph/line/word)chars
: include per-character information, in addition to per-word information, in the extracted text (ignored iftextext
is not included)- form: include the
formfield
elements - text: include the text drawing operations
(
textop
elements) - image: include the image drawing operations
(
image
andimagemask
elements) - vector: include the vector drawing operations
(
fill
andstroke
elements) - struct: include the structure tree
-output
option overrides some older options:-outline
,-chars
, and-textops
. -outline
- Include the document outline (aka bookmarks) in the XML output.
(
-output
overrides this option.) -chars
- Output per-character information, in addition to per-word
information. The default (without the
-chars
option) is to generate per-word information only. See PDFdeconstruct Output for details. (-output
overrides this option.) -textops
- Output text drawing operations (
textop
elements), in addition to extracted text (column
elements, etc.). See Drawing Operators for details. (-output
overrides this option.) -sepres
- Output the
info
andresources
elements in one XML file (doc.xml
), andpage
elements in a second XML file (pages.xml
). The default is to output everything (page
,info
, andresources
elements) indoc.xml
. See PDFdeconstruct Output for details. -seppages
- Output the
info
andresources
elements in one XML file (doc.xml
), and eachpage
element in a separate XML file (pageNNNNNN.xml
). -unit unit:places
- Set the unit and number of decimal places for position and size
output. The unit can be any of:
- "
pt
" (PostScript point = 1/72 inch) - "
inch
" - "
mil
" (mil = 0.001 inch) - "
mm
"
-unit inch:3
" generates position output in the form "1.234
" with a unit of inches. The default setting is "-unit pt:2
". - "
-imagefmt format
- Set the image file format to one of: "
PNG
", "TIFF
", or "JPEG
". All images will be converted to the specified format. The default is PNG. -keepjpeg
- Output JPEG image streams as JPEG files. "DCTDecode" images in a
PDF file are in standard JPEG format. With this flag, all DCTDecode
images will be copied directly to JPEG files in the output directory
without decoding and re-encoding, regardless of
the
-imagefmt
setting. (There is one exception: CMYK DCTDecode streams are always re-encoded, because many JPEG readers don't properly handle CMYK JPEG files.) -nofields
- Do not include form field values in the extracted text. Field
values will still be included in
formfield
elements. This is useful if the XML consumer will be drawing form fields based on theformfield
elements. -nopatterns
- With this option, tiling patterns will be rendered as a single
fill or stroke operation, with
<color type="pattern"/>
. Without this option, i.e., by default, tiling patterns are reduced to the tile content, repeated for each cell. -cleantt
- Rewrite TrueType fonts to clean up certain errors. This will correct some known problems occasionally found in embedded TrueType fonts in PDF files.
-table
- Extract the text in table mode, instead of reading order mode. This will generally split the text into smaller pieces.
-discardInvisible
- Discard all "invisible" text, i.e., text drawn in the invisible rendering mode (which is most often used for OCR text).
-discardClipped
- Discard all clipped text, i.e., text which is drawn outside of the clipping region and is therefore not visible.
-f page-number
- Set the first page to convert. The
-f
and-l
options can be used to select a range of pages smaller than the whole PDF file. The default is to convert the whole file ("-f 1 -l N"
). -l page-number
- Set the last page to convert. See the description
of
-f
above. -pw password
- Set the password for an encrypted PDF file. This can be either the owner password or the user password. If the input PDF file is encrypted ("protected") with copying of text and graphics disabled, PDFdeconstruct will not run without the owner password.