How to extract just plain text from .doc & .docx files? [closed]
If you want the pure plain text(my requirement) then all you need is unzip -p some.docx word/document.xml | sed -e ‘s/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g’ Which I found at command line fu It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.