How do I remove emoji characters from a string?

Assuming you just want to remove all non-BMP characters, i.e. anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. For example:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string text = "x\U0001F310y";
        Console.WriteLine(text.Length); // 4
        string result = Regex.Replace(text, @"\p{Cs}", "");
        Console.WriteLine(result); // 2
    }
}

Here “Cs” is the Unicode category for “surrogate”.

It appears that Regex works based on UTF-16 code units rather than Unicode code points, otherwise you’d need a different approach.

Note that there are non-BMP characters other than emoji, but I suspect you’ll find they’ll have the same problem when you try to store them.

Additionally, not that this won’t remove emojis in the BMP, such as U+2764 (red heart). You can use the above as an example of how to remove characters in specific Unicode categories – the category for U+2764 is “other symbol” for example. Now whether you want to remove all “other symbols” is a different matter.

But if really you’re interested in just removing surrogate pairs because they can’t be stored properly, the above should be fine.

More Related Contents:

Leave a Comment Cancel reply