Read Tables from pdf using python [duplicate]

Your document is encrypted. Have a look at the pdf trailer:

trailer
<< /Root 2 0 R
   /Info 1 0 R
   /ID [<BC5D1FCFDAF3326F2552B3182CCF1E18> <BC5D1FCFDAF3326F2552B3182CCF1E18>]
   /Encrypt 36 0 R
   /Size 37
>>

/Encrypt name refers to object number 36 generation 0. Let’s use pdfreader to dive deeper:

from pdfreader import PDFDocument
fd = open("10027183.pdf","rb")  
doc = PDFDocument(fd)
obj = doc.locate_object(36,0)
print(obj)

you see

{'Filter': 'Standard', 
 'V': 2, 
 'R': 3, 
 'Length': 128, 
 'P': -3897, 
 'O': '36451BD39D753B7C1D10922C28E6665AA4F3353FB0348B536893E3B1DB5C579B', 
 'U': '7AFCC66F84741480C7129FC777BB1CDE28BF4E5E4E758A4164004E56FFFA0108'}

Value of V=2 stands for RC4 or AES algorithms permitting encryption key lengths greater than 40 bits. In your case it’s just an empty password, as Adobe Reader doesn’t asks for any password. Nevertheless all the data is encrypted still.

According to PDF spec “Encryption applies to all strings and streams …” with few exceptions. This means you need to decrypt all streams and strings before data extraction.

More Related Contents:

Leave a Comment Cancel reply