How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

You’ll need to reverse-engineer what the Javascript is doing. Does it fire off an AJAX request to populate the <div>? If so, it should be pretty easy to sniff the request using Firebug and then duplicate it with LWP::UserAgent or WWW::Mechanize to get the information. If the Javascript is just doing pure DOM manipulation, then … Read more

Text Extraction from HTML Java

jsoup Another html parser I really liked using was jsoup. You could get all the <p> elements in 2 lines of code. Document doc = Jsoup.connect(“http://en.wikipedia.org/”).get(); Elements ps = doc.select(“p”); Then write it out to a file in one more line out.write(ps.text()); //it will append all of the p elements together in one long string … Read more

Scraping contents of multi web pages of a website using BeautifulSoup and Selenium

If you want to get the last page number of the above the link for proceeding, which is 499 you can use either Selenium or Beautifulsoup as follows : Selenium : from selenium import webdriver driver = webdriver.Firefox(executable_path=r’C:\Utility\BrowserDrivers\geckodriver.exe’) url = “http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061” driver.get(url) element = driver.find_element_by_xpath(“//div[@class=”row pagination”]//p/span[contains(.,’Reviews on Reliance Jio’)]”) driver.execute_script(“return arguments[0].scrollIntoView(true);”, element) print(driver.find_element_by_xpath(“//ul[@class=”pagination table”]/li/ul[@class=”pages table”]//li[last()]/a”).get_attribute(“innerHTML”)) … Read more

Looping over urls to do the same thing

PhantomJS is asynchronous. By calling page.open() multiple times using a loop, you essentially rush the execution of the callback. You’re overwriting the current request before it is finished with a new request which is then again overwritten. You need to execute them one after the other, for example like this: page.open(url, function () { waitFor(function() … Read more

Using Python and Mechanize to submit form data and authenticate

I would definitely suggest trying to use the API if possible, but this works for me (not for your example post, which has been deleted, but for any active one): #!/usr/bin/env python import mechanize import cookielib import urllib import logging import sys def main(): br = mechanize.Browser() cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) … Read more

Nokogiri, open-uri, and Unicode Characters

Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(…).read and pass the resulting string to Nokogiri. Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. “GenealogĂ­a de Jesucristo”. But even with a magic comment on the Ruby file and setting … Read more