screen-scraping - w3toppers.com

How can i grab CData out of BeautifulSoup

One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser. By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here #Trying it with html.parser >>> from bs4 import BeautifulSoup >>> import bs4 >>> s=””‘<?xml … Read more

How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

You’ll need to reverse-engineer what the Javascript is doing. Does it fire off an AJAX request to populate the <div>? If so, it should be pretty easy to sniff the request using Firebug and then duplicate it with LWP::UserAgent or WWW::Mechanize to get the information. If the Javascript is just doing pure DOM manipulation, then … Read more

Click on a javascript link within python?

I mainly use HtmlUnit under jython for these use cases. Also I published a simple article on the subject: Web Scraping Ajax and Javascript sites.

Text Extraction from HTML Java

jsoup Another html parser I really liked using was jsoup. You could get all the <p> elements in 2 lines of code. Document doc = Jsoup.connect(“http://en.wikipedia.org/”).get(); Elements ps = doc.select(“p”); Then write it out to a file in one more line out.write(ps.text()); //it will append all of the p elements together in one long string … Read more

How to use the WebClient.DownloadDataAsync() method in this context?

There is a newer DownloadDataTaskAsync method that allows you to await the result. It is simpler to read and easier to wire up by far. I’d use that… var client = new WebClient(); var data = await client.DownloadDataTaskAsync(new Uri(imageUrl)); await outstream.WriteAsync(data, 0, data.Length);

View Generated Source (After AJAX/JavaScript) in C#

it is possibly using an instance of a browser (in you case: the ie control). you can easily use in your app and open a page. the control will then load it and process any javascript. once this is done you can access the controls dom object and get the “interpreted” code.

Scraping contents of multi web pages of a website using BeautifulSoup and Selenium

If you want to get the last page number of the above the link for proceeding, which is 499 you can use either Selenium or Beautifulsoup as follows : Selenium : from selenium import webdriver driver = webdriver.Firefox(executable_path=r’C:\Utility\BrowserDrivers\geckodriver.exe’) url = “http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061” driver.get(url) element = driver.find_element_by_xpath(“//div[@class=”row pagination”]//p/span[contains(.,’Reviews on Reliance Jio’)]”) driver.execute_script(“return arguments[0].scrollIntoView(true);”, element) print(driver.find_element_by_xpath(“//ul[@class=”pagination table”]/li/ul[@class=”pages table”]//li[last()]/a”).get_attribute(“innerHTML”)) … Read more

Looping over urls to do the same thing

PhantomJS is asynchronous. By calling page.open() multiple times using a loop, you essentially rush the execution of the callback. You’re overwriting the current request before it is finished with a new request which is then again overwritten. You need to execute them one after the other, for example like this: page.open(url, function () { waitFor(function() … Read more

Using Python and Mechanize to submit form data and authenticate

I would definitely suggest trying to use the API if possible, but this works for me (not for your example post, which has been deleted, but for any active one): #!/usr/bin/env python import mechanize import cookielib import urllib import logging import sys def main(): br = mechanize.Browser() cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) … Read more

Nokogiri, open-uri, and Unicode Characters

Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(…).read and pass the resulting string to Nokogiri. Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. “Genealogía de Jesucristo”. But even with a magic comment on the Ruby file and setting … Read more