I need a Powerful Web Scraper library [closed]

Scraping is easy really, you just have to parse the content you are downloading and get all the associated links.

The most important piece though is the part that processes the HTML. Because most browsers don’t require the cleanest (or standards-compliant) HTML in order to be rendered, you need an HTML parser that is going to be able to make sense of HTML that is not always well-formed.

I recommend you use the HTML Agility Pack for this purpose. It does very well at handling non-well-formed HTML, and provides an easy interface for you to use XPath queries to get nodes in the resulting document.

Beyond that, you just need to pick a data store to hold your processed data (you can use any database technology for that) and a way to download content from the web, which .NET provides two high-level mechanisms for, the WebClient and HttpWebRequest/HttpWebResponse classes.

Leave a Comment