Apache PDFBox: problems with encoding

Question

This answer is actually an explanation why a generic solution for your task is at least very complicated if not impossible. Under benign circumstances, i.e. for PDFs subject to specific restrictions, code like yours can be successfully used, but your example PDF shows that the PDFs you apparently want to manipulate are not restricted like that.

Why automatic replacement of text is difficult/impossible

There are a number of factors that impede automatic replacement of text in PDFs, some already making finding the instructions for drawing the text in question difficult, and some complicating the replacing the characters in the arguments of those instructions.

The list of problems illustrated here is not exhaustive!

Finding instructions drawing a specific text

PDFs contain content streams which contain sequences of instructions telling a PDF processor where to draw what. Regular text in PDFs is drawn by instructions setting the current font (and font size), setting the position to draw the text at, and actually drawing text. This can be as easy to understand and search for as this:

/TT0 1 Tf
9 0 0 9 5 5 Tm
(file:///C/Users/Mi/Downloads/converted.txt[10.03.2020 18:43:57]) Tj

(Here the font TT0 with size 1 is selected, then an affine transformation is applied to scale text by a factor of 9 and move to the position (5, 5), and finally the text “file:///C/Users/Mi/Downloads/converted.txt [10.03.2020 18:43:57]” is drawn.)

In such a case searching the instructions responsible for drawing a given piece of text is easy. But the instructions in question may also look differently.

Split lines

For example the string may be drawn in pieces, instead of the Tj instruction above, we may have

[(file:///C/Users/Mi/Downloads/converted.txt)2 ([10.03.2020 18:43:57])] TJ

(Here first “file:///C/Users/Mi/Downloads/converted.txt” is drawn, then the text drawing position is slightly moved, then “[10.03.2020 18:43:57]” is drawn, both in the same TJ instruction.)

Or you may see

(file:///C/Users/Mi/Downloads/converted.txt) Tj
([10.03.2020 18:43:57]) Tj

(The text parts drawn in different instructions.)

Also the order of text pieces may be unexpected:

([10.03.2020 18:43:57]) Tj 
-40 0 Td
(file:///C/Users/Mi/Downloads/converted.txt) Tj

(First the date string is drawn, then the text position is moved left quite a bit before the drawn date, the the URL is drawn.)

Some PDF producers draw each character separately, setting the whole text transformation in between:

9 0 0 9 5 5 Tm
(f) Tj
9 0 0 9 14 5 Tm
(i) Tj
9 0 0 9 23 5 Tm
(l) Tj
...

And these different instructions need not be arranged in sequence as here, they can be spread over the whole stream, even over multiple streams as a page can have an array of content streams instead of a single one or part of the string may be drawn in the content stream of a sub-object referenced from the page content stream.

Thus, for finding the instructions responsible for a specific, multi-character text, you may have to inspect multiple streams and glue the strings you found together according to the position they have been drawn at.

Ligatures

Not every single character code might correspond to a single character as in your search string. There are a number of special glyphs for combinations of characters like ﬂ for fl etc. So for searching one has to expand such ligatures.

Encodings

In the examples above, the characters of the text were easy to recognize even if the text was not drawn in a single run. But in PDFs the encoding of the characters need not be so obvious, actually each font may come with an own encoding, e.g.

<004B0048004F004F0052000400040004>Tj

can draw “hello!!!”.

(Here the string argument is written as hex string, in the debugger you saw “KHOOR…”.)

Thus, for searching text, one needs to first map the string arguments of text drawing instructions to Unicode depending on the specific encoding of the current font.

But the PDF does not need to contain a mapping from the individual codes to Unicode characters, there may only be a mapping to the glyph id in the font file. In case of embedded fonts files, these font files then don’t need to contain any mapping to Unicode characters either.

Often PDF files do have information on the Unicode characters matching the codes to allow text extraction e.g. for copy/paste; strictly speaking, though, such information is optional; even worse, that information may contain errors without creating issues when displaying the PDF. In all such situations one has to use OCR like mechanisms to recognize the Unicode characters associated with each glyph.

Replacing text in instructions

Once you found the instructions responsible for drawing the text you searched, you have to replace the text. This may also imply some problems.

Subset fonts

If font files are embedded in a PDF, they often merely are embedded as subsets of the original fonts to save space. E.g. in your example PDF the font Tahoma used to display “hello!!!” only is embedded with the following glyphs:

Even Times New Roman (the font used for the text you could recognize) is only subset embedded with the following glyphs:

Thus, even if you found the “hello!!!” in Tahoma, simply replacing the character codes to mean “byebye??” would only display ” e e ” as the only character for which a glyph is present in the embedded font is the ‘e’.

Thus, to replace you may either have to edit the embedded font file and the representing PDF font object to contain and encode all required glyphs, or to add another font and instructions to switch to that font for the manipulated text drawing instructions and back again thereafter.

Font encodings

Even if your font is not embedded at all (so your complete local copy of the font will be used) or embedded with all the glyphs you need, the encoding used for your font may be limited. In Western European language based PDFs you will often find WinAnsiEncoding, an encoding similar to Windows code page 1252. If you want to replace with Cyrillic text, there are no character codes for those characters.

Thus in this case you might have to change the encoding to include all the characters you need (by finding unused characters in the present encoding by scanning all uses of the font in question) or add another font with a more apropos encoding.

Layout considerations

If your replacement text is longer or shorter than the replaced text and there is other text following on the same line in the PDF, you have to decide whether that text should be moved, too, or not. It may belong together and has to be shifted accordingly, but it may alternatively be from a separate text block or column in which case it should not be moved.

Text justification may also be damaged.

Also consider marked text (underline / strike through / background color / …). These markings in PDF (usually) are not font properties but separate vector graphics. To get these right, you have to parse the vector graphics and annotations from the page, heuristically identify text markings, and update them.

Tagged PDFs

If you deal with tagged PDFs (e.g. for accessibility), this may make finding text easier (as accessibility should allow for easy text extraction) but replacing text harder because you may also have to update some tags or structure tree data.

How to implement a generic text replacement nonetheless

As shown above there are a lot of hindrances to text replacement in PDFs. Thus, a complete solution (where possible at all) is far beyond the scope of a stack overflow answer. Some pointers, though:

To find the text to replace you should make use of the PdfTextStripper (a PDFBox utility class for text extraction) and extend it to have all the text with pointers to the text drawing instruction that draws each character respectively. This way you don’t have to implement all the decoding and sorting of the text.

To replace the text you can ask the PDFBox font classes (provided by the PdfTextStripper if extended accordingly) whether they can encode your replacement text.

And always have a copy of the PDF specification (ISO 32000-1 or ISO 32000-2) at your hands…

But do be aware that it will take you a while, a number of weeks or months, to get a somewhat decent generic solution.