TL;DR
- Converting to Lowercase ->
lower()
- Caseless String matching/comparison ->
casefold()
casefold()
is a text normalization function like lower()
that is specifically designed to remove upper- or lower-case distinctions for the purposes of comparison. It is another form of normalizing text that may initially appear to be very similar to lower() because generally, the results are the same. As of Unicode 13.0.0, only ~300 of ~150,000 characters produced differing results when passed through lower()
and casefold()
. @dlukes’ answer has the code to identify the characters that generate those differing results.
To answer your other two questions:
- use
lower()
when you specifically want to ensure a character is lowercase, like for presenting to users or persisting data - use
casefold()
when you want to compare that result to anothercasefold
-ed value.
Other Material
I suggest you take a closer look into what case folding actually is, so here’s a good start: W3 Case Folding Wiki
Another source:
Elastic.co Case Folding
Edit: I just recently found another very good related answer to a slightly different question here on SO (doing a case-insensitive string comparison)
Performance
Using this snippet, you can get a sense for the performance between the two:
import sys
from timeit import timeit
unicode_codepoints = tuple(map(chr, range(sys.maxunicode)))
def compute_lower():
return tuple(codepoint.lower() for codepoint in unicode_codepoints)
def compute_casefold():
return tuple(codepoint.casefold() for codepoint in unicode_codepoints)
timer_repeat = 1000
print(f"time to compute lower on unicode namespace: {timeit(compute_lower, number = timer_repeat) / timer_repeat} seconds")
print(f"time to compute casefold on unicode namespace: {timeit(compute_casefold, number = timer_repeat) / timer_repeat} seconds")
print(f"number of distinct characters from lower: {len(set(compute_lower()))}")
print(f"number of distinct characters from casefold: {len(set(compute_casefold()))}")
Running this, you’ll get the results that the two are overwhelmingly the same in both performance and the number of distinct characters returned
time to compute lower on unicode namespace: 0.137255663 seconds
time to compute casefold on unicode namespace: 0.136321374 seconds
number of distinct characters from lower: 1112719
number of distinct characters from casefold: 1112694
If you run the numbers, that means it takes about 1.6e-07 seconds to run the computation on a single character for either function, so there isn’t a performance benefit either way.