How to download any(!) webpage with correct charset in python?

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted: fp = urllib2.urlopen(request) charset = fp.headers.getparam(‘charset’) You can use BeautifulSoup to locate a meta element in the HTML: soup = BeatifulSoup.BeautifulSoup(data) meta = soup.findAll(‘meta’, {‘http-equiv’:lambda v:v.lower()==’content-type’}) If neither is available, browsers typically fall back to user … Read more

Perform screen-scape of Webbrowser control in thread

You can write private Image TakeSnapShot(WebBrowser browser) { browser.Width = browser.Document.Body.ScrollRectangle.Width; browser.Height= browser.Document.Body.ScrollRectangle.Height; Bitmap bitmap = new Bitmap(browser.Width – System.Windows.Forms.SystemInformation.VerticalScrollBarWidth, browser.Height); browser.DrawToBitmap(bitmap, new Rectangle(0, 0, bitmap.Width, bitmap.Height)); return bitmap; } A full working code var image = await WebUtils.GetPageAsImageAsync(“http://www.stackoverflow.com”); image.Save(fname , System.Drawing.Imaging.ImageFormat.Bmp); public class WebUtils { public static Task<Image> GetPageAsImageAsync(string url) { var tcs = … Read more

Screen Scraping from a web page with a lot of Javascript [closed]

You may consider using HTMLunit It’s a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There’s also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but … Read more

scrape websites with infinite scrolling

You can use selenium to scrap the infinite scrolling website like twitter or facebook. Step 1 : Install Selenium using pip pip install selenium Step 2 : use the code below to automate infinite scroll and extract the source code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import … Read more

unable to call firefox from selenium in python on AWS machine

The problem is Firefox requires a display. I’ve used pyvirtualdisplay in my example to simulate a display. The solution is: from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(1024, 768)) display.start() driver= webdriver.Firefox() driver.get(“http://www.somewebsite.com/”) <—some code—> #driver.close() # Close the current window. driver.quit() # Quit the driver and close every associated window. … Read more

Scrape a dynamic website

This is a difficult problem because you either have to reverse engineer the javascript on a per-site basis, or implement a javascript engine and run the scripts (which has its own difficulties and pitfalls). It’s a heavy weight solution, but I’ve seen people doing this with greasemonkey scripts – allow Firefox to render everything and … Read more