parsing XML with ampersand

Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you’re absolutely sure the values do not contain other escaped items.

For example, "wow&".Replace("&", "&") results in wow& which is clearly undesirable.

Regex.Replace can give you more control to avoid this scenario, and can be written to only match “&” symbols that are not part of other characters, such as <, something like:

string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");

The above works, but admittedly it doesn’t cover the variety of other characters that start with an ampersand, such as   and the list can grow.

A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:

string result = Regex.Replace(test, @"value=\""(.*?)\""", m => "value=\"" +
    HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
    "\"");
var doc = XElement.Parse(result);

Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.


EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.

string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
            m.Groups["start"].Value +
            HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
            m.Groups["end"].Value);
var doc = XElement.Parse(result);

Leave a Comment