XPath:: Get following Sibling

You should be looking for the second tr that has the td that equals ‘ Color Digest ‘, then you need to look at either the following sibling of the first td in the tr, or the second td. Try the following: //tr[td=’Color Digest’][2]/td/following-sibling::td[1] or //tr[td=’Color Digest’][2]/td[2] http://www.xpathtester.com/saved/76bb0bca-1896-43b7-8312-54f924a98a89

BeautifulSoup: extract text from anchor tag

This will help: from bs4 import BeautifulSoup data=””‘<div class=”image”> <a href=”http://www.example.com/eg1″>Content1<img src=”http://image.example.com/img1.jpg” /></a> </div> <div class=”image”> <a href=”http://www.example.com/eg2″>Content2<img src=”http://image.example.com/img2.jpg” /> </a> </div>”’ soup = BeautifulSoup(data) for div in soup.findAll(‘div’, attrs={‘class’:’image’}): print(div.find(‘a’)[‘href’]) print(div.find(‘a’).contents[0]) print(div.find(‘img’)[‘src’]) If you are looking into Amazon products then you should be using the official API. There is at least one Python package … Read more

How to scrape a website that requires login first with Python

This works for me: ##################################### Method 1 import mechanize import cookielib from BeautifulSoup import BeautifulSoup import html2text # Browser br = mechanize.Browser() # Cookie Jar cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [(‘User-agent’, ‘Chrome’)] # The site we will navigate into, handling it’s session br.open(‘https://github.com/login’) # … Read more

scrape websites with infinite scrolling

You can use selenium to scrap the infinite scrolling website like twitter or facebook. Step 1 : Install Selenium using pip pip install selenium Step 2 : use the code below to automate infinite scroll and extract the source code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import … Read more