Unicode Regex; Invalid XML characters

I know this isn’t exactly an answer to your question, but it’s helpful to have it here:

Regular Expression to match valid XML Characters:

[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]

So to remove invalid chars from XML, you’d do something like

// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
    @"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
    RegexOptions.Compiled);

/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
    if (string.IsNullOrEmpty(text)) return "";
    return _invalidXMLChars.Replace(text, "");
}

I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.

Leave a Comment