How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.

pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)

Edit adding Python from Denilson Sá’s script in the question body:

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)    

Leave a Comment