Assuming you just want to remove all non-BMP characters, i.e. anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. For example:
using System;
using System.Text.RegularExpressions;
class Test
{
static void Main(string[] args)
{
string text = "x\U0001F310y";
Console.WriteLine(text.Length); // 4
string result = Regex.Replace(text, @"\p{Cs}", "");
Console.WriteLine(result); // 2
}
}
Here “Cs” is the Unicode category for “surrogate”.
It appears that Regex
works based on UTF-16 code units rather than Unicode code points, otherwise you’d need a different approach.
Note that there are non-BMP characters other than emoji, but I suspect you’ll find they’ll have the same problem when you try to store them.
Additionally, not that this won’t remove emojis in the BMP, such as U+2764 (red heart). You can use the above as an example of how to remove characters in specific Unicode categories – the category for U+2764 is “other symbol” for example. Now whether you want to remove all “other symbols” is a different matter.
But if really you’re interested in just removing surrogate pairs because they can’t be stored properly, the above should be fine.