- Text Extraction
The process of extracting text from PDF files seems to raise a lot of questions. This document is an attempt to explain PDF text extraction and to answer these questions.
Not a markup language
First, PDF is not a markup format like HTML. That is, PDF files do not contain plain text, with some extra decoration denoting different fonts, boldface, etc.
Rather, PDF is a page description format, much closer to PostScript than to HTML. Each page in a PDF file is defined by a content stream(s) containing a series of commands. These commands change the current color, draw filled polygons, change the current font, draw text, and so on. Text can be (and usually is) broken into small chunks for purposes of kerning. To display "Text can be...", a PDF file might draw "T", then move back a little to the left, then draw "ext can be...". A PDF text extractor must reassemble this into the proper sequence of characters.
Fonts, encodings, and subsets
The next problem concerns fonts, or more specifically encodings. A PDF file can use any number of fonts. A font is a collection of glyphs - Times-Roman, Helvetica, and Courier each have their own glyph for the letter 'A', for example. Each font also has an encoding, which is a mapping from character codes (numbers) to glyph names. Sometimes, this looks like plain old ASCII: code 46 maps to 'period', code 48 maps to 'zero', code 65 maps to 'A', etc. A PDF text extractor converts the code to glyph names, using the font encoding, and then converts to the glyph names to whatever output encoding was requested (ASCII, UTF-8, etc.).
There's no rule that requires use of a standard encoding (like ASCII), nor is there any rule that requires use of standard glyph names (like 'period', 'zero', and 'A'). If a font contained a glyph named 'Alice' for the letter 'T', a glyph named 'Bob' for the letter 'h', and a glyph named 'Charlie' for the letter 'e', and the font's encoding mapped code 97 to 'Alice', code 14 to 'Bob', and code 53 to 'Charlie', then a string containing the code sequence (97, 14, 53) would generate the word 'The' on the screen or printer.
Why would PDF generation software do something that crazy? In general, it wouldn't. But it can do something almost as bad when it creates font subsets. (A font subset is a font that contains only the glyphs actually used in the document - this makes the PDF file smaller, and also makes it harder to pirate the font.) For example, if a PDF only used 'T', 'h', and 'e' in a particular font, the PDF generator might create a subset font containing just those three glyphs. And it might rename the glyphs 'p01', 'p02', and 'p03', and encode them as codes 1, 2, and 3. In this kind of situation, there is no way to get the text back out of the PDF file, short of OCRing it. Type 3 fonts are also notoriously bad about providing useful information in their encodings.
Figuring it out
Fortunately, it's more common for font subsets to leave enough information around for a clever text extractor to use. For example, the glyph names often contain the original character codes: the ASCII code for 'T' is 84, and the font subset might use 'p84' as the glyph name for this character.
Additionally, newer PDF files provide 'ToUnicode' tables for their fonts. These map character codes straight to Unicode, avoiding all of the problems with encodings and glyph names.
The free pdftotext tool, part of the Xpdf package, extracts plain text from PDF files. It can generate various text encodings, including plain ASCII, 8-bit Latin1, UTF-8 Unicode, and various standard CJK encodings, and can be configured to generate custom encodings. If you want to avoid the overhead of running a separate executable, the XpdfText® library provides the same functionality.
The pdffonts tool, also part of Xpdf, will list all of the fonts used by a PDF file, along with their types, whether the fonts are embedded or not, and whether the fonts have ToUnicode mappings or not.
|Copyright 2014 Glyph & Cog, LLC|