screen-scraping - w3toppers.com

How can I scrape an HTML table to CSV?

Select the HTML table in your tools’s UI and copy it into the clipboard (if that’s possible Paste it into Excel. Save as CSV file However, this is a manual solution not an automated one.

How to download any(!) webpage with correct charset in python?

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted: fp = urllib2.urlopen(request) charset = fp.headers.getparam(‘charset’) You can use BeautifulSoup to locate a meta element in the HTML: soup = BeatifulSoup.BeautifulSoup(data) meta = soup.findAll(‘meta’, {‘http-equiv’:lambda v:v.lower()==’content-type’}) If neither is available, browsers typically fall back to user … Read more

Scrape web pages in real time with Node.js

Node.io seems to take the cake 🙂

Perform screen-scape of Webbrowser control in thread

You can write private Image TakeSnapShot(WebBrowser browser) { browser.Width = browser.Document.Body.ScrollRectangle.Width; browser.Height= browser.Document.Body.ScrollRectangle.Height; Bitmap bitmap = new Bitmap(browser.Width – System.Windows.Forms.SystemInformation.VerticalScrollBarWidth, browser.Height); browser.DrawToBitmap(bitmap, new Rectangle(0, 0, bitmap.Width, bitmap.Height)); return bitmap; } A full working code var image = await WebUtils.GetPageAsImageAsync(“http://www.stackoverflow.com”); image.Save(fname , System.Drawing.Imaging.ImageFormat.Bmp); public class WebUtils { public static Task<Image> GetPageAsImageAsync(string url) { var tcs = … Read more

Screen Scraping from a web page with a lot of Javascript [closed]

You may consider using HTMLunit It’s a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There’s also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but … Read more

scrape websites with infinite scrolling

You can use selenium to scrap the infinite scrolling website like twitter or facebook. Step 1 : Install Selenium using pip pip install selenium Step 2 : use the code below to automate infinite scroll and extract the source code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import … Read more

unable to call firefox from selenium in python on AWS machine

The problem is Firefox requires a display. I’ve used pyvirtualdisplay in my example to simulate a display. The solution is: from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(1024, 768)) display.start() driver= webdriver.Firefox() driver.get(“http://www.somewebsite.com/”) <—some code—> #driver.close() # Close the current window. driver.quit() # Quit the driver and close every associated window. … Read more

CasperJS passing data back to PHP

I think the best way to transfer data from CasperJS to another language such as PHP is running CasperJS script as a service. Because CasperJS has been written over PhantomJS, CasperJS can use an embedded web server module of PhantomJS called Mongoose. For information about how works the embedded web server see here Here an … Read more

Protection from screen scraping [closed]

You can’t prevent it.

Scrape a dynamic website

This is a difficult problem because you either have to reverse engineer the javascript on a per-site basis, or implement a javascript engine and run the scripts (which has its own difficulties and pitfalls). It’s a heavy weight solution, but I’ve seen people doing this with greasemonkey scripts – allow Firefox to render everything and … Read more