How can I convert surrogate pairs to normal string in Python?

You’ve mixed a literal string \ud83d in a json file on disk (six characters: \ u d 8 3 d) and a single character u'\ud83d' (specified using a string literal in Python source code) in memory. It is the difference between len(r'\ud83d') == 6 and len('\ud83d') == 1 on Python 3.

If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn’t get such string. If you get one and you can’t fix upstream that generates it; you could fix it using surrogatepass error handler:

>>> "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
'🙏'

Python 2 was more permissive.

Note: even if your json file contains literal \ud83d\ude4f (12 characters); you shouldn’t get the surrogate pair:

>>> print(ascii(json.loads(r'"\ud83d\ude4f"')))
'\U0001f64f'

Notice: the result is 1 character ( '\U0001f64f'), not the surrogate pair ('\ud83d\ude4f').

Leave a Comment