Watermarking with PDFBox

UPDATED ANSWER (Better version with easy way to watermark, thanks to the commentators below and @okok who provided input with his answer) If you are using PDFBox 1.8.10 or above, you can add watermark to your PDF document easily with better control over what pages needs to be watermarked. Assuming you have a one page … Read more

PDFBox : PDPageContentStream’s append mode misbehaving

Use the constructor that has a fifth parameter, so to reset the graphic context. public PDPageContentStream(PDDocument document, PDPage sourcePage, boolean appendContent, boolean compress, boolean resetContext) throws IOException alternatively, save and restore the graphics state in the first content stream by calling saveGraphicsState(); // … restoreGraphicsState();

remove invisible text from pdf using pdfbox

The invisible text in the OP’s sample PDF mostly is made invisible by defining clip paths (outside the bounds of which the text is) and by filling paths (hiding the text underneath). Thus, we have to consider path related instructions during text extraction to ignore that invisible text. Unfortunately call backs designed for these instructions … Read more

Writing Arabic with PDFBOX with correct characters presentation form without being separated

Notice: The sample code in this answer might be outdated please refer to h q’s answer for the working sample code At First I will thank Tilman Hausherr and M.Prokhorov for showing me the library that made writing Arabic possible using PDFBox Apache. This Answer will be divided into two Sections: Downloading the library and … Read more

Is it possible to justify text in PDFBOX?

This older answer shows how to break a string into substrings fitting into a given width. To make the sample code there draw the substrings in a manner to fill the whole line widths, replace as follows (depending on the PDFBox version): PDFBox 1.8.x Replace the final loop for (String line: lines) { contentStream.drawString(line); contentStream.moveTextPositionByAmount(0, … Read more

Text coordinates when stripping from PDFBox

This is just another case of the excessive PdfTextStripper coordinate normalization. Just like you I had thought that by using TextPosition.getTextMatrix() (instead of getX() and getY) one would get the actual coordinates, but no, even these matrix values have to be corrected (at least in PDFBox 2.0.x, I haven’t checked 1.8.x) because the matrix is … Read more

extract images from pdf using pdfbox

Here is code using PDFBox 2.0.1 that will get a list of all images from the PDF. This is different than the other code in that it will recurse through the document instead of trying to get the images from the top level. public List<RenderedImage> getImagesFromPDF(PDDocument document) throws IOException { List<RenderedImage> images = new ArrayList<>(); … Read more

PdfBox encode symbol currency euro

Unfortunately PDFBox’s String encoding is far from perfect yet (version 1.8.x). Unfortunately it uses the same routines when encoding strings in generic PDF objects as when encoding strings in content streams which is fundamentally wrong. Thus, instead of using PDPageContentStream.drawString (which uses that wrong encodings), you have to translate to the correct encoding yourself. E.g. … Read more