getNumRemovedDupChars
Get the number of removed duplicate chars on the most recent page.
getNumRemovedDupChars([out, retval] int *n)
This function returns the number of removed duplicate characters on
the most recently converted page or region, i.e., the last page from
the last call to
convertToTextFile
, convertToTextString
,
extractTextFromRect
, extractTextFromRect2
, buildWordList
,
or buildWordListFromRect2
.
This function, along with getNumVisibleChars
and
getNumInvisibleChars
, are useful for detecting problematic scanned
pages. In "electronic" (non-scanned) PDF files, all of the text will
be visible, and there will be zero invisible characters. In most
cases, removed duplicate characters occur in "fake boldface" text, and
the number of removed duplicates is small. Invisible characters are
used in scanned PDF files, where invisible OCR text is overlaid on top
of the scanned image. If an electronic PDF file is OCRed, it can end
up with both visible and invisible characters.
VB:
nVis = pdf.getNumVisibleChars()
nInvis = pdf.getNumInvisibleChars()
nDup = pdf.getNumRemovedDupChars()