The XpdfText® library/component extracts plain text from PDF files. The PDF file can be on disk or in memory, and likewise, the text can be extracted to memory or directly to disk.
XpdfText can be used in different ways:
Convert entire PDF files or individual pages to plain text
- maintaining layout, or
- converting to "reading order"
Extract text from a specified rectangle on a page
- useful for extracting text from forms
Convert pages into word lists – for each word, you can retrieve:
- font name and font size
- text color
- word position on the page
- character offset (for highlight files)
The extracted text can be converted to a wide choice of standard encodings, including UTF-8 Unicode, ISO-8859-1 (Latin-1), 7-bit ASCII, and various other language-specific encodings.
The XpdfText library also includes all of the functionality of XpdfInfo.
XpdfText is easy to use:
- Windows: DLL
- Windows: COM component - usable from .NET, Visual Basic, Delphi, etc.
- Mac OS X: shared library
- Linux: shared library
- 32-bit and 64-bit versions available for all platforms
- other platforms: portable C++ source code for the library is available
See also: For content extraction to XML (instead of plain text), try our PDFdeconstruct tool.
Contact Glyph & Cog for more information, including pricing, documentation, and evaluation copies.