Can I scrape the raw data from highcharts.js?

The data is in a script tag. You can get the script tag using bs4 and a regex. You could also extract the data using a regex but I like using /js2xml to parse js functions into a xml tree: from bs4 import BeautifulSoup import requests import re import js2xml soup = BeautifulSoup(requests.get(“http://www.worldweatheronline.com/brussels-weather-averages/be.aspx”).content, “html.parser”) script … Read more

BeautifulSoup returns None even though the element exists

try this code: from selenium import webdriver import time from bs4 import BeautifulSoup driver = webdriver.Chrome() url= “http://www.pro-football-reference.com/boxscores/201309050den.htm” driver.maximize_window() driver.get(url) time.sleep(5) content = driver.page_source.encode(‘utf-8’).strip() soup = BeautifulSoup(content,”html.parser”) officials = soup.findAll(“table”,{“id”:”officials”}) for entry in officials: print(str(entry)) driver.quit() It will print: <table class=”suppress_all sortable stats_table now_sortable” data-cols-to-freeze=”0″ id=”officials”><thead><tr class=”thead onecell”><td class=” center” colspan=”2″ data-stat=”onecell”>Officials</td></tr></thead><caption>Officials Table</caption><tbody> <tr data-row=”0″><th … Read more

Fetching multiple urls with aiohttp in python

Working example: import asyncio import aiohttp import ssl url_list = [‘https://api.pushshift.io/reddit/search/comment/?q=Nestle&size=30&after=1530396000&before=1530436000’, ‘https://api.pushshift.io/reddit/search/comment/?q=Nestle&size=30&after=1530436000&before=1530476000’] async def fetch(session, url): async with session.get(url, ssl=ssl.SSLContext()) as response: return await response.json() async def fetch_all(urls, loop): async with aiohttp.ClientSession(loop=loop) as session: results = await asyncio.gather(*[fetch(session, url) for url in urls], return_exceptions=True) return results if __name__ == ‘__main__’: loop = asyncio.get_event_loop() urls = … Read more

Python 3: using requests does not get the full content of a web page

The page is rendered with JavaScript making more requests to fetch additional data. You can fetch the complete page with selenium. from bs4 import BeautifulSoup from selenium import webdriver driver = webdriver.Chrome() url = “https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72” driver.get(url) soup = BeautifulSoup(driver.page_source, ‘html.parser’) driver.quit() print(soup.prettify()) For other solutions see my answer to Scraping Google Finance (BeautifulSoup)

Scraping Data from a website which uses Power BI – retrieving data from Power BI on a website

Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question): parent = driver.find_element_by_xpath(‘//*[@id=”pvExplorationHost”]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div’) children = parent.find_elements_by_xpath(‘.//*’) Then sort them using their location: x = [child.location[‘x’] for child in children] y = [child.location[‘y’] for … Read more

scrapy- how to stop Redirect (302)

yes you can do this simply by adding meta values like meta={‘dont_redirect’: True} also you can stop redirected for a particular response code like meta={‘dont_redirect’: True,”handle_httpstatus_list”: [302]} it will stop redirecting only 302 response codes. you can add as many http status code you want to avoid redirecting them. example yield Request(‘some url’, meta = … Read more