Identifier normalization: Why is the micro sign converted into the Greek letter mu?

There are two different characters involved here. One is the MICRO SIGN, which is the one on the keyboard, and the other is GREEK SMALL LETTER MU.

To understand what’s going on, we should take a look at how Python defines identifiers in the language reference:

identifier   ::=  xid_start xid_continue*
id_start     ::=  <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue  ::=  <all characters in id_start, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start    ::=  <all characters in id_start whose NFKC normalization is in "id_start xid_continue*">
xid_continue ::=  <all characters in id_continue whose NFKC normalization is in "id_continue*">

Both our characters, MICRO SIGN and GREEK SMALL LETTER MU, are part of the Ll unicode group (lowercase letters), so both of them can be used at any position in an identifier. Now note that the definition of identifier actually refers to xid_start and xid_continue, and those are defined as all characters in the respective non-x definition whose NFKC normalization results in a valid character sequence for an identifier.

Python apparently only cares about the normalized form of identifiers. This is confirmed a bit below:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

NFKC is a Unicode normalization that decomposes characters into individual parts. The MICRO SIGN decomposes into GREEK SMALL LETTER MU, and that’s exactly what’s going on there.

There are a lot other characters that are also affected by this normalization. One other example is OHM SIGN which decomposes into GREEK CAPITAL LETTER OMEGA. Using that as an identifier gives a similar result, here shown using locals:

>>> Ω = 'bar'
>>> locals()['Ω']
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    locals()['Ω']
KeyError: 'Ω'
>>> [k for k, v in locals().items() if v == 'bar'][0].encode()
b'\xce\xa9'
>>> 'Ω'.encode()
b'\xe2\x84\xa6'

So in the end, this is just something that Python does. Unfortunately, there isn’t really a good way to detect this behavior, causing errors such as the one shown. Usually, when the identifier is only referred to as an identifier, i.e. it’s used like a real variable or attribute, then everything will be fine: The normalization runs every time, and the identifier is found.

The only problem is with string-based access. Strings are just strings, of course there is no normalization happening (that would be just a bad idea). And the two ways shown here, getattr and locals, both operate on dictionaries. getattr() accesses an object’s attribute via the object’s __dict__, and locals() returns a dictionary. And in dictionaries, keys can be any string, so it’s perfectly fine to have a MICRO SIGN or a OHM SIGN in there.

In those cases, you need to remember to perform a normalization yourself. We can utilize unicodedata.normalize for this, which then also allows us to correctly get our value from inside locals() (or using getattr):

>>> normalized_ohm = unicodedata.normalize('NFKC', 'Ω')
>>> locals()[normalized_ohm]
'bar'

Leave a Comment