Selecting attribute values with html Agility Pack

You can directly grab the attribute if you use the HtmlNavigator instead. //Load document from some html string HtmlDocument hdoc = new HtmlDocument(); hdoc.LoadHtml(htmlContent); //Load navigator for current document HtmlNodeNavigator navigator = (HtmlNodeNavigator)hdoc.CreateNavigator(); //Get value from given xpath string xpath = “//div[@id=’topslot’]/a/img/@src”; string val = navigator.SelectSingleNode(xpath).Value;

HTML agility pack – removing unwanted tags without removing content?

I wrote an algorithm based on Oded’s suggestions. Here it is. Works like a charm. It removes all tags except strong, em, u and raw text nodes. internal static string RemoveUnwantedTags(string data) { if(string.IsNullOrEmpty(data)) return string.Empty; var document = new HtmlDocument(); document.LoadHtml(data); var acceptableTags = new String[] { “strong”, “em”, “u”}; var nodes = new … Read more

Parsing HTML page with HtmlAgilityPack

There are a number of ways to select elements using the agility pack. Let’s assume we have defined our HtmlDocument as follows: string html = @”<TD class=texte width=””50%””> <DIV align=right>Name :<B> </B></DIV></TD> <TD width=””50%””> <INPUT class=box value=John maxLength=16 size=16 name=user_name> </TD> <TR vAlign=center>”; HtmlDocument htmlDoc = new HtmlDocument(); htmlDoc.LoadHtml(html); 1. Simple LINQ We could use … Read more

Grab all text from html with Html Agility Pack

XPATH is your friend 🙂 HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(@”<html><body><p>foo <a href=”http://www.example.com”>bar</a> baz</p></body></html>”); foreach(HtmlNode node in doc.DocumentNode.SelectNodes(“//text()”)) { Console.WriteLine(“text=” + node.InnerText); }

HTML Agility Pack strip tags NOT IN whitelist

heh, apparently I ALMOST found an answer in a blog post someone made…. using System.Collections.Generic; using System.Linq; using HtmlAgilityPack; namespace Wayloop.Blog.Core.Markup { public static class HtmlSanitizer { private static readonly IDictionary<string, string[]> Whitelist; static HtmlSanitizer() { Whitelist = new Dictionary<string, string[]> { { “a”, new[] { “href” } }, { “strong”, null }, { “em”, … Read more