Read .doc file with python

One can use the textract library.
It take care of both “doc” as well as “docx”

import textract
text = textract.process("path/to/file.extension")

You can even use ‘antiword’ (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

antiword filename.doc > filename.docx

Ultimately, textract in the backend is using antiword.

Leave a Comment