Scrape the absolute URL instead of a relative path in python

urllib.parse.urljoin() might help. It does a join, but it is smart about it and handles both relative and absolute paths. Note this is python 3 code.

>>> import urllib.parse
>>> base="https://www.example-page-xl.com"

>>> urllib.parse.urljoin(base, "https://stackoverflow.com/helloworld/index.php") 
'https://www.example-page-xl.com/helloworld/index.php'

>>> urllib.parse.urljoin(base, 'https://www.example-page-xl.com/helloworld/index.php')
'https://www.example-page-xl.com/helloworld/index.php'

More Related Contents:

How to handle IncompleteRead: in python
retrieve links from web page using python and BeautifulSoup [closed]
UnicodeEncodeError: ‘charmap’ codec can’t encode characters
Beautiful Soup: ‘ResultSet’ object has no attribute ‘find_all’?
BeautifulSoup Grab Visible Webpage Text
python BeautifulSoup parsing table
How to avoid HTTP error 429 (Too Many Requests) python
Only extracting text from this element, not its children
Can I remove script tags with BeautifulSoup?
Difference between “findAll” and “find_all” in BeautifulSoup
Remove a tag using BeautifulSoup but keep its contents
bs4.FeatureNotFound: Couldn’t find a tree builder with the features you requested: lxml. Do you need to install a parser library?
BeautifulSoup webscraping find_all( ): finding exact match
BeautifulSoup innerhtml?
Using BeautifulSoup to find a HTML tag that contains certain text
How can I insert a new tag into a BeautifulSoup object?
Why do I get a recursion error with BeautifulSoup and IDLE?
Beautiful Soup and extracting a div and its contents by ID
What should I do if socket.setdefaulttimeout() is not working?
How to extract a JSON object that was defined in a HTML page javascript block using Python?
Matching partial ids in BeautifulSoup
BeautifulSoup – modifying all links in a piece of HTML?
Python regular expression for HTML parsing
Python BeautifulSoup: wildcard attribute/id search
How to install beautiful soup 4 with python 2.7 on windows
how to get text from within a tag, but ignore other child tags
PyQt Class not working for the second usage
How to download a full webpage with a Python script?
BeautifulSoup returns None even though the element exists
Extract content within a tag with BeautifulSoup

More Related Contents:

Leave a Comment Cancel reply