Reading XML with an “&” into C# XMLDocument Object

The problem is the xml is not well-formed. Properly generated xml would list the data like this:

Prepaid & Charge

I’ve fixed the same problem before, and I did it with this regex:

Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");

Combine that with a string constant defined like this:

const string goodAmpersand = "&";

Now you can say badAmpersand.Replace(<your input>, goodAmpersand);

Note a simple String.Replace("&", "&amp;") isn’t good enough, since you can’t know in advance for a given document whether any & characters will be coded correctly, incorrectly, or even both in the same document.

The catches here are you have to do this to your xml document before loading it into your parser, which likely means an extra pass through the document. Also, it does not account for ampersands inside of a CDATA section. Finally, it only catches ampersands, not other illegal characters like <. Update: based on the comment, I need to update the expression for hex-coded (&#x…;) entities as well.

Regarding which characters can cause problems, the actual rules are a little complex. For example, certain characters are allowed in data, but not as the first letter of an element name. And there’s no simple list of illegal characters. Instead, large (non-contiguous) swaths of UNICODE are defined as legal, and anything outside that is illegal.

When it comes down to it, you have to trust your document source to have at least a certain amount of compliance and consistency. For example, I’ve found people are often smart enough to make sure the tags work properly and escape <, even if they don’t know that & isn’t allowed, hence your problem today. However, the best thing would be to get this fixed at the source.

Oh, and a note about the CDATA suggestion: I use that to make sure xml I’m creating is well-formed, but when dealing with existing xml from outside, I find the regex method easier.

Leave a Comment