pdfGetNumRemovedDupChars
Get the number of removed duplicate chars on the most recent page.
int pdfGetNumRemovedDupChars(PDFHandle pdf)
This function returns the number of removed duplicate characters on
the most recently converted page or region, i.e., the last page from
the last call to
pdfConvertToTextFile
, pdfConvertToTextString
,
pdfExtractTextFromRect
, pdfExtractTextFromRect2
,
pdfBuildWordList
, or pdfBuildWordListFromRect2
.
This function, along with pdfGetNumVisibleChars
and
pdfGetNumInvisibleChars
, are useful for detecting problematic
scanned pages. In "electronic" (non-scanned) PDF files, all of the
text will be visible, and there will be zero invisible characters. In
most cases, removed duplicate characters occur in "fake boldface"
text, and the number of removed duplicates is small. Invisible
characters are used in scanned PDF files, where invisible OCR text is
overlaid on top of the scanned image. If an electronic PDF file is
OCRed, it can end up with both visible and invisible characters.
C:
int nVis, nInvis, nDup;
nVis = pdfGetNumVisibleChars(pdf);
nInvis = pdfGetNumInvisibleChars(pdf);
nDup = pdfGetNumRemovedDupChars(pdf);