Text coordinates when stripping from PDFBox

This is just another case of the excessive PdfTextStripper coordinate normalization. Just like you I had thought that by using TextPosition.getTextMatrix() (instead of getX() and getY) one would get the actual coordinates, but no, even these matrix values have to be corrected (at least in PDFBox 2.0.x, I haven’t checked 1.8.x) because the matrix is multiplied by a translation making the lower left corner of the crop box the origin.

Thus, in your case (in which the lower left of the crop box is not the origin), you have to correct the values, e.g. by replacing

        float x = minx;
        float y = firstPosition.getTextMatrix().getTranslateY();

by

        PDRectangle cropBox = doc.getPage(0).getCropBox();

        float x = minx + cropBox.getLowerLeftX();
        float y = firstPosition.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY();

Instead of

without correction

you now get

with x,y correction

Obviously, though, you will also have to correct the height somewhat. This is due to the way the PdfTextStripper determines the text height:

    // 1/2 the bbox is used as the height todo: why?
    float glyphHeight = bbox.getHeight() / 2;

(from showGlyph(...) in LegacyPDFStreamEngine, the parent class of PdfTextStripper)

While the font bounding box indeed usually is too large, half of it often is not enough.

Leave a Comment