How can I extract tables from PDF documents?

The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task.

Generally speaking, if you are sure enough future PDFs will be generated by the same software in a very similar manner, it might be worth the time to investigate the file for some easy to follow hints to recognize the contents of individual fields.

Your specific document, though, has an additional shortcoming: It does not contain the required information for direct text extraction! You can try copying & pasting from Adobe Reader and you’ll get (at least I do) semi-random characters from the WinAnsi range.

This is due to the fact that all fonts in the document claim that they use WinAnsiEncoding even though the characters referenced this way definitively are not from the WinAnsi character selection.

Thus reliable text extraction from your document without OCR is impossible after all!

(Trying copy&paste from Adobe Reader generally is a good first test whether text extraction is feasible at all; the text extraction methods of the Reader have been developed for many many years and, therefore, have become quite good. If you cannot extract anything sensible with Acrobat Reader, text extraction will be a very difficult task indeed.)

Leave a Comment