Simple html dom file_get_html not working – is there any workaround?

As I said, your example is working fine for me… But try this way using curl instead: //base url $base=”https://play.google.com/store/apps”; $curl = curl_init(); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($curl, CURLOPT_HEADER, false); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); curl_setopt($curl, CURLOPT_URL, $base); curl_setopt($curl, CURLOPT_REFERER, $base); curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE); $str = curl_exec($curl); curl_close($curl); // Create a DOM object $html_base = new simple_html_dom(); // … Read more

How can I remove attributes from an html tag?

Although there are better ways, you could actually strip arguments from html tags with a regular expression: <?php function stripArgumentFromTags( $htmlString ) { $regEx = ‘/([^<]*<\s*[a-z](?:[0-9]|[a-z]{0,9}))(?:(?:\s*[a-z\-]{2,14}\s*=\s*(?:”[^”]*”|\'[^\’]*\’))*)(\s*\/?>[^<]*)/i’; // match any start tag $chunks = preg_split($regEx, $htmlString, -1, PREG_SPLIT_DELIM_CAPTURE); $chunkCount = count($chunks); $strippedString = ”; for ($n = 1; $n < $chunkCount; $n++) { $strippedString .= $chunks[$n]; … Read more

how to use dom php parser

First i have to tell you that you can’t use the same id on two different divs; there are classes for that point. Every element should have an unique id. Code to get the contents of the div with id=”interestingbox” $html=” <html> <head></head> <body> <div id=”interestingbox”> <div id=”interestingdetails” class=”txtnormal”> <div>Content1</div> <div>Content2</div> </div> </div> <div id=”interestingbox2″><a … Read more

Parse the JavaScript returned from BeautifulSoup

Something like PhantomJS may be more robust, but here’s some basic Python code to extract it the full menu: import json import re import urllib2 text = urllib2.urlopen(‘http://dcsd.nutrislice.com/menu/meadow-view/lunch/’).read() menu = json.loads(re.search(r”bootstrapData\[‘menuMonthWeeks’\]\s*=\s*(.*);”, text).group(1)) print menu After that, you’ll want to search through the menu for the date you’re interested in. EDIT: Some overkill on my part: … Read more

How to read HTML as XML?

HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.

What is the best practice for parsing remote content with jQuery?

Instead of hacking jQuery to do this I’d suggest you drop out of jQuery for a minute and use raw XML dom methods. Using XML Dom methods you would can do this: window.onload = function(){ $.ajax({ type: ‘GET’, url: ‘text.html’, dataType: ‘html’, success: function(data) { //cross platform xml object creation from w3schools try //Internet Explorer … Read more

Web scraping in PHP

I recommend you consider simple_html_dom for this. It will make it very easy. Here is a working example of how to pull the title, and first image. <?php require ‘simple_html_dom.php’; $html = file_get_html(‘http://www.google.com/’); $title = $html->find(‘title’, 0); $image = $html->find(‘img’, 0); echo $title->plaintext.”<br>\n”; echo $image->src; ?> Here is a second example that will do the … Read more

How to change tag name with BeautifulSoup?

I don’t know how you’re accessing tag but the following works for me: import BeautifulSoup if __name__ == “__main__”: data = “”” <html> <h2 class=”someclass”>some title</h2> <ul> <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li> <li>Aliquam tincidunt mauris eu risus.</li> <li>Vestibulum auctor dapibus neque.</li> </ul> </html> “”” soup = BeautifulSoup.BeautifulSoup(data) h2 = soup.find(‘h2’) h2.name=”h1″ print … Read more

php regex to get string inside href tag

Dont use regex for this. You can use xpath and built in php functions to get what you want: $xml = simplexml_load_string($myHtml); $list = $xml->xpath(“//@href”); $preparedUrls = array(); foreach($list as $item) { $item = parse_url($item); $preparedUrls[] = $item[‘scheme’] . ‘://’ . $item[‘host’] . “https://stackoverflow.com/”; } print_r($preparedUrls);