C# Extract text from PDF using PdfSharp

Took Sergio’s answer and made some extension methods. I also changed the accumulation of strings into an iterator. public static class PdfSharpExtensions { public static IEnumerable<string> ExtractText(this PdfPage page) { var content = ContentReader.ReadContent(page); var text = content.ExtractText(); return text; } public static IEnumerable<string> ExtractText(this CObject cObject) { if (cObject is COperator) { var cOperator … Read more

How can I read pdf in python? [duplicate]

You can USE PyPDF2 package # install PyPDF2 pip install PyPDF2 Once you have it installed: # importing all the required modules import PyPDF2 # creating a pdf reader object reader = PyPDF2.PdfReader(‘example.pdf’) # print the number of pages in pdf file print(len(reader.pages)) # print the text of the first page print(reader.pages[0].extract_text()) Follow the documentation.

Extract floating point numbers from a delimited string in PHP

$str=”152.15 x 12.34 x 11mm”; preg_match_all(‘!\d+(?:\.\d+)?!’, $str, $matches); $floats = array_map(‘floatval’, $matches[0]); print_r($floats); The (?:…) regular expression construction is what’s called a non-capturing group. What that means is that chunk isn’t separately returned in part of the $mathces array. This isn’t strictly necessary in this case but is a useful construction to know. Note: calling … Read more

Text Extraction from HTML Java

jsoup Another html parser I really liked using was jsoup. You could get all the <p> elements in 2 lines of code. Document doc = Jsoup.connect(“http://en.wikipedia.org/”).get(); Elements ps = doc.select(“p”); Then write it out to a file in one more line out.write(ps.text()); //it will append all of the p elements together in one long string … Read more

Extract all email addresses from bulk text using jquery

Here’s how you can approach this: HTML <p id=”emails”></p> JavaScript var text=”[email protected], “assdsdf” <[email protected]>, “rodnsdfald ferdfnson” <[email protected]>, “Affdmdol Gondfgale” <[email protected]>, “truform techno” <[email protected]>, “NiTsdfeSh ThIdfsKaRe” <[email protected]>, “akasdfsh kasdfstla” <[email protected]>, “Bisdsdfamal Prakaasdsh” <[email protected]>,; “milisdfsfnd ansdfasdfnsftwar” <[email protected]> datum eternus [email protected]”; function extractEmails (text) { return text.match(/([a-zA-Z0-9._+-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi); } $(“#emails”).text(extractEmails(text).join(‘\n’)); Result [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected] Source: Extract email from bulk text (with … Read more