web-scraping - w3toppers.com

Can I scrape the raw data from highcharts.js?

The data is in a script tag. You can get the script tag using bs4 and a regex. You could also extract the data using a regex but I like using /js2xml to parse js functions into a xml tree: from bs4 import BeautifulSoup import requests import re import js2xml soup = BeautifulSoup(requests.get(“http://www.worldweatheronline.com/brussels-weather-averages/be.aspx”).content, “html.parser”) script … Read more

Web scraping with Google Apps Script

Go to that website in your browser and open the developer tools (F12 or ctr-shift-i). Click on the network tab and reload the page with F5. A list of requests will appear. At the bottom of the list you should see the asynchronous requests made to fetch the information. Those requests get the data in … Read more

Clicking link using beautifulsoup in python

BeautifulSoup is an HTML parser. Further discussion really depends on the concrete situation you are in and the complexity of the particular web page. If you need to interact with a web-page: submit forms, click buttons, scroll etc – you need to use a tool that utilizes a real browser, like selenium. In certain situations, … Read more

BeautifulSoup returns None even though the element exists

try this code: from selenium import webdriver import time from bs4 import BeautifulSoup driver = webdriver.Chrome() url= “http://www.pro-football-reference.com/boxscores/201309050den.htm” driver.maximize_window() driver.get(url) time.sleep(5) content = driver.page_source.encode(‘utf-8’).strip() soup = BeautifulSoup(content,”html.parser”) officials = soup.findAll(“table”,{“id”:”officials”}) for entry in officials: print(str(entry)) driver.quit() It will print: <table class=”suppress_all sortable stats_table now_sortable” data-cols-to-freeze=”0″ id=”officials”><thead><tr class=”thead onecell”><td class=” center” colspan=”2″ data-stat=”onecell”>Officials</td></tr></thead><caption>Officials Table</caption><tbody> <tr data-row=”0″><th … Read more

Fetching multiple urls with aiohttp in python

Working example: import asyncio import aiohttp import ssl url_list = [‘https://api.pushshift.io/reddit/search/comment/?q=Nestle&size=30&after=1530396000&before=1530436000’, ‘https://api.pushshift.io/reddit/search/comment/?q=Nestle&size=30&after=1530436000&before=1530476000’] async def fetch(session, url): async with session.get(url, ssl=ssl.SSLContext()) as response: return await response.json() async def fetch_all(urls, loop): async with aiohttp.ClientSession(loop=loop) as session: results = await asyncio.gather(*[fetch(session, url) for url in urls], return_exceptions=True) return results if __name__ == ‘__main__’: loop = asyncio.get_event_loop() urls = … Read more

Python 3: using requests does not get the full content of a web page

The page is rendered with JavaScript making more requests to fetch additional data. You can fetch the complete page with selenium. from bs4 import BeautifulSoup from selenium import webdriver driver = webdriver.Chrome() url = “https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72” driver.get(url) soup = BeautifulSoup(driver.page_source, ‘html.parser’) driver.quit() print(soup.prettify()) For other solutions see my answer to Scraping Google Finance (BeautifulSoup)

Scraping Data from a website which uses Power BI – retrieving data from Power BI on a website

Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question): parent = driver.find_element_by_xpath(‘//*[@id=”pvExplorationHost”]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div’) children = parent.find_elements_by_xpath(‘.//*’) Then sort them using their location: x = [child.location[‘x’] for child in children] y = [child.location[‘y’] for … Read more

scrapy- how to stop Redirect (302)

yes you can do this simply by adding meta values like meta={‘dont_redirect’: True} also you can stop redirected for a particular response code like meta={‘dont_redirect’: True,”handle_httpstatus_list”: [302]} it will stop redirecting only 302 response codes. you can add as many http status code you want to avoid redirecting them. example yield Request(‘some url’, meta = … Read more

I need a Powerful Web Scraper library [closed]

Scraping is easy really, you just have to parse the content you are downloading and get all the associated links. The most important piece though is the part that processes the HTML. Because most browsers don’t require the cleanest (or standards-compliant) HTML in order to be rendered, you need an HTML parser that is going … Read more

“Failed to decode response from marionette” message in Python/Firefox headless scraping script

For anyone else experiencing this issue when running selenium webdriver in a Docker container, increasing the container size to 2gb fixes this issue. I guess this affects physical machines too if the OP fixed their issue by upgrading their server RAM to 2Gb, but could be coincidence.