how to know if a field is on a particular page?

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I’m not sure which fields are on which pages

The reason for this is that PDFs contain a global object structure defining the form. A form field in this structure may have 0, 1, or more visualizations on 0, 1, or more actual PDF pages. Furthermore, in case of only 1 visualization, a merge of field object and visualization object is allowed.

PDFBox 1.8.x

Unfortunately PDFBox in its PDAcroForm and PDField objects represents only this object structure and does not provide easy access to the associated pages. By accessing the underlying structures, though, you can build the connection.

The following code should make clear how to do that:

@SuppressWarnings("unchecked")
public void printFormFields(PDDocument pdfDoc) throws IOException {
    PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();

    List<PDPage> pages = docCatalog.getAllPages();
    Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
    for (int i = 0; i < pages.size(); i++) {
        PDPage page = pages.get(i);
        for (PDAnnotation annotation : page.getAnnotations())
            pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
    }

    PDAcroForm acroForm = docCatalog.getAcroForm();

    for (PDField field : (List<PDField>)acroForm.getFields()) {
        COSDictionary fieldDict = field.getDictionary();

        List<Integer> annotationPages = new ArrayList<Integer>();
        List<COSObjectable> kids = field.getKids();
        if (kids != null) {
            for (COSObjectable kid : kids) {
                COSBase kidObject = kid.getCOSObject();
                if (kidObject instanceof COSDictionary)
                    annotationPages.add(pageNrByAnnotDict.get(kidObject));
            }
        }

        Integer mergedPage = pageNrByAnnotDict.get(fieldDict);

        if (mergedPage == null)
            if (annotationPages.isEmpty())
                System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
            else
                System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
        else
            if (annotationPages.isEmpty())
                System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
            else
                System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
    }
}

Beware, there are two shortcomings in the PDFBox PDAcroForm form field handling:

The PDF specification allows the global object structure defining the form to be a deep tree, i.e. the actual fields do not have to be direct children of the root but may be organized by means of inner tree nodes. PDFBox ignores this and expects the fields to be direct children of the root.
Some PDFs in the wild, foremost older ones, do not contain the field tree but only reference the field objects from the pages via the visualizing widget annotations. PDFBox does not see these fields in its PDAcroForm.getFields list.

PS: @mikhailvs in his answer correctly shows that you can retrieve a page object from a field widget using PDField.getWidget().getPage() and determine its page number using catalog.getAllPages().indexOf. While being fast this getPage() method has a drawback: It retrieves the page reference from an optional entry of the widget annotation dictionary. Thus, if the PDF you process has been created by software that fills that entry, all is well, but if the PDF creator has not filled that entry, all you get is a null page.

PDFBox 2.0.x

In 2.0.x some methods for accessing the elements in question have changed but not the situation as a whole, to safely retrieve the page of a widget you still have to iterate through the pages and find a page that references the annotation.

The safe methods:

int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
{
    COSDictionary widgetObject = widget.getCOSObject();
    PDPageTree pages = document.getPages();
    for (int i = 0; i < pages.getCount(); i++)
    {
        for (PDAnnotation annotation : pages.get(i).getAnnotations())
        {
            COSDictionary annotationObject = annotation.getCOSObject();
            if (annotationObject.equals(widgetObject))
                return i;
        }
    }
    return -1;
}

The fast method

int determineFast(PDDocument document, PDAnnotationWidget widget)
{
    PDPage page = widget.getPage();
    return page != null ? document.getPages().indexOf(page) : -1;
}

Usage:

PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
    for (PDField field : acroForm.getFieldTree())
    {
        System.out.println(field.getFullyQualifiedName());
        for (PDAnnotationWidget widget : field.getWidgets())
        {
            System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)");
            System.out.printf(" - fast: %s", determineFast(document, widget));
            System.out.printf(" - safe: %s\n", determineSafe(document, widget));
        }
    }
}

(DetermineWidgetPage.java)

(In contrast to the 1.8.x code the safe method here simply searches for the page of a single field. If in your code you have to determine the page of many widgets, you should create a lookup Map like in the 1.8.x case.)

Example documents

A document for which the fast method fails: aFieldTwice.pdf

A document for which the fast method works: test_duplicate_field2.pdf

PDFBox 1.8.x

PDFBox 2.0.x

Example documents

More Related Contents:

Leave a Comment Cancel reply