Retrieve the respective coordinates of all words on the page with itextsharp

(I’m mostly working with the Java library iText, not with the .Net library iTextSharp; thus, please ignore some Java-isms here, everything should be easy to translate.)

For extracting contents of a page using iText(Sharp), you employ the classes in the parser package to feed it after some preprocessing to a RenderListener of your choice.

In a context in which you are only interested in the text, you most commonly use a TextExtractionStrategy which is derived from RenderListener and adds a single method getResultantText to retrieve the aggregated text from the page.

As the initial intent of text parsing in iText was to implement this use case, most existing RenderListener samples are TextExtractionStrategy implementations and only make the text available.

Therefore, you will have to implement your own RenderListener which you already seem to have christianed TextWithPositionExtractionStategy.

Just like there is both a SimpleTextExtractionStrategy (which is implemented with some assumptions about the structure of the page content operators) and a LocationTextExtractionStrategy (which does not have the same assumptions but is somewhat more complicated), you might want to start with an implementation that makes some assumptions.

Thus, just like in the case of the SimpleTextExtractionStrategy, you in your first, simple implementation expect the text rendering events forwarded to your listener to arrive line by line, and on the same line from left to right. This way, as soon as you find a horizontal gap or a punctation, you know your current word is finished and you can process it.

In contrast to the text extraction strategies you don’t need a StringBuffer member to collect your result but instead a list of some “word with position” structure. Furthermore you need some member variable to hold the TextRenderInfo events you already collected for this page but could not finally process (you may retrieve a word in several separate events).

As soon as you (i.e. your renderText method) are called for a new TextRenderInfo object, you should operate like this (pseudo-code):

if (unprocessedTextRenderInfos not empty)
{
    if (isNewLine // Check this like the simple text extraction strategy checks for hardReturn
     || isGapFromPrevious) // Check this like the simple text extraction strategy checks whether to insert a space
    {
        process(unprocessedTextRenderInfos);
        unprocessedTextRenderInfos.clear();
    }
}

split new TextRenderInfo using its getCharacterRenderInfos() method;
while (characterRenderInfos contain word end)
{
    add characterRenderInfos up to excluding the white space/punctuation to unprocessedTextRenderInfos;
    process(unprocessedTextRenderInfos);
    unprocessedTextRenderInfos.clear();
    remove used render infos from characterRenderInfos;
}
add remaining characterRenderInfos to unprocessedTextRenderInfos;

In process(unprocessedTextRenderInfos) you extract the information you need from the unprocessedTextRenderInfos; you concatenate the individual text contents to a word and take the coordinates you want; if you merely want starting coordinates, you take those from the first of those unprocessed TextRenderInfos. If you need more data, you also use the data from the other TextRenderInfos. With these data you fill a “word with position” structure and add it to your result list.

When page processing is finished, you have to once more call process(unprocessedTextRenderInfos) and unprocessedTextRenderInfos.clear(); alternatively you may do that in the endTextBlock method.

Having done this, you might feel ready to implement the slightly more complex variant which does not have the same assumptions concerning the page content structure. 😉

Leave a Comment