Scrape the absolute URL instead of a relative path in python

urllib.parse.urljoin() might help. It does a join, but it is smart about it and handles both relative and absolute paths. Note this is python 3 code.

>>> import urllib.parse
>>> base="https://www.example-page-xl.com"

>>> urllib.parse.urljoin(base, "https://stackoverflow.com/helloworld/index.php") 
'https://www.example-page-xl.com/helloworld/index.php'

>>> urllib.parse.urljoin(base, 'https://www.example-page-xl.com/helloworld/index.php')
'https://www.example-page-xl.com/helloworld/index.php'

Leave a Comment