How can I extract text from a PDF file in Perl?

These modules you can acheive the extract text from pdf

PDF::API2

CAM::PDF

CAM::PDF::PageText

From CPAN

   my $pdf = CAM::PDF->new($filename);
   my $pageone_tree = $pdf->getPageContentTree(1);
   print CAM::PDF::PageText->render($pageone_tree);

This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

Leave a Comment