extract - w3toppers.com

How to retrieve comments from within an XML Document in PHP

SimpleXML cannot handle comments, but the DOM extension can. Here’s how you can extract all the comments. You just have to adapt the XPath expression to target the node you want. $doc = new DOMDocument; $doc->loadXML( ‘<doc> <node><!– First node –></node> <node><!– Second node –></node> </doc>’ ); $xpath = new DOMXPath($doc); foreach ($xpath->query(‘//comment()’) as $comment) … Read more

Javascript: extract URLs from string (inc. querystring) and return array

I just use URI.js — makes it easy. var source = “Hello www.example.com,\n” + “http://google.com is a search engine, like http://www.bing.com\n” + “http://exämple.org/foo.html?baz=la#bumm is an IDN URL,\n” + “http://123.123.123.123/foo.html is IPv4 and ” + “http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html is IPv6.\n” + “links can also be in parens (http://example.org) ” + “or quotes »http://example.org«.”; var result = URI.withinString(source, function(url) … Read more

Extracting image from PDF with /CCITTFaxDecode filter

Actually, vbcrlfuser’s answer did help me, but the code was not quite correct for the current version of BitMiracle.LibTiff.NET, as I could download it. In the current version, equivalent code looks like this: using iTextSharp.text.pdf; using BitMiracle.LibTiff.Classic; … Tiff tiff = Tiff.Open(“C:\\test.tif”, “w”); tiff.SetField(TiffTag.IMAGEWIDTH, UInt32.Parse(pd.Get(PdfName.WIDTH).ToString())); tiff.SetField(TiffTag.IMAGELENGTH, UInt32.Parse(pd.Get(PdfName.HEIGHT).ToString())); tiff.SetField(TiffTag.COMPRESSION, Compression.CCITTFAX4); tiff.SetField(TiffTag.BITSPERSAMPLE, UInt32.Parse(pd.Get(PdfName.BITSPERCOMPONENT).ToString())); tiff.SetField(TiffTag.SAMPLESPERPIXEL, 1); tiff.WriteRawStrip(0, raw, … Read more

PHP String Manipulation: Extract hrefs

You can use PHPs DOMDocument library to parse XML and/or HTML. Something like the following should do the trick, to get the href attribute from a string of HTML. $html=”<h1>Doctors</h1> <a title=”C – G” href=”https://stackoverflow.com/questions/4702987/linkl.html”>C – G</a> <a title=”G – K” href=”link2.html”>G – K</a> <a title=”K – M” href=”link3.html”>K – M</a>”; $hrefs = array(); $dom … Read more

Calculating frequency of each word in a sentence in java

Use a map with word as a key and count as value, somthing like this Map<String, Integer> map = new HashMap<>(); for (String w : words) { Integer n = map.get(w); n = (n == null) ? 1 : ++n; map.put(w, n); } if you are not allowed to use java.util then you can sort … Read more

Extract string before “|” [duplicate]

Print lines in one file matching patterns in another file

Try grep -Fwf file2 file1 > out The -F option specifies plain string matching, so should be faster without having to engage the regex engine.

Extract files from zip without keeping the structure using python ZipFile?

This opens file handles of members of the zip archive, extracts the filename and copies it to a target file (that’s how ZipFile.extract works, without taking care of subdirectories). import os import shutil import zipfile my_dir = r”D:\Download” my_zip = r”D:\Download\my_file.zip” with zipfile.ZipFile(my_zip) as zip_file: for member in zip_file.namelist(): filename = os.path.basename(member) # skip directories … Read more

How do you extract a url from a string using python?

There may be few ways to do this but the cleanest would be to use regex >>> myString = “This is a link http://www.google.com” >>> print re.search(“(?P<url>https?://[^\s]+)”, myString).group(“url”) http://www.google.com If there can be multiple links you can use something similar to below >>> myString = “These are the links http://www.google.com and http://stackoverflow.com/questions/839994/extracting-a-url-in-python” >>> print re.findall(r'(https?://[^\s]+)’, … Read more

How to extract just plain text from .doc & .docx files? [closed]

If you want the pure plain text(my requirement) then all you need is unzip -p some.docx word/document.xml | sed -e ‘s/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g’ Which I found at command line fu It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.