How can I scrape an HTML table to CSV?
Select the HTML table in your tools’s UI and copy it into the clipboard (if that’s possible Paste it into Excel. Save as CSV file However, this is a manual solution not an automated one.
Select the HTML table in your tools’s UI and copy it into the clipboard (if that’s possible Paste it into Excel. Save as CSV file However, this is a manual solution not an automated one.
When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted: fp = urllib2.urlopen(request) charset = fp.headers.getparam(‘charset’) You can use BeautifulSoup to locate a meta element in the HTML: soup = BeatifulSoup.BeautifulSoup(data) meta = soup.findAll(‘meta’, {‘http-equiv’:lambda v:v.lower()==’content-type’}) If neither is available, browsers typically fall back to user … Read more
Node.io seems to take the cake 🙂
You can write private Image TakeSnapShot(WebBrowser browser) { browser.Width = browser.Document.Body.ScrollRectangle.Width; browser.Height= browser.Document.Body.ScrollRectangle.Height; Bitmap bitmap = new Bitmap(browser.Width – System.Windows.Forms.SystemInformation.VerticalScrollBarWidth, browser.Height); browser.DrawToBitmap(bitmap, new Rectangle(0, 0, bitmap.Width, bitmap.Height)); return bitmap; } A full working code var image = await WebUtils.GetPageAsImageAsync(“http://www.stackoverflow.com”); image.Save(fname , System.Drawing.Imaging.ImageFormat.Bmp); public class WebUtils { public static Task<Image> GetPageAsImageAsync(string url) { var tcs = … Read more
You may consider using HTMLunit It’s a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There’s also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but … Read more
You can use selenium to scrap the infinite scrolling website like twitter or facebook. Step 1 : Install Selenium using pip pip install selenium Step 2 : use the code below to automate infinite scroll and extract the source code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import … Read more
The problem is Firefox requires a display. I’ve used pyvirtualdisplay in my example to simulate a display. The solution is: from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(1024, 768)) display.start() driver= webdriver.Firefox() driver.get(“http://www.somewebsite.com/”) <—some code—> #driver.close() # Close the current window. driver.quit() # Quit the driver and close every associated window. … Read more
I think the best way to transfer data from CasperJS to another language such as PHP is running CasperJS script as a service. Because CasperJS has been written over PhantomJS, CasperJS can use an embedded web server module of PhantomJS called Mongoose. For information about how works the embedded web server see here Here an … Read more
You can’t prevent it.
This is a difficult problem because you either have to reverse engineer the javascript on a per-site basis, or implement a javascript engine and run the scripts (which has its own difficulties and pitfalls). It’s a heavy weight solution, but I’ve seen people doing this with greasemonkey scripts – allow Firefox to render everything and … Read more