lower() vs. casefold() in string matching and converting to lowercase

TL;DR

  • Converting to Lowercase -> lower()
  • Caseless String matching/comparison -> casefold()

casefold() is a text normalization function like lower() that is specifically designed to remove upper- or lower-case distinctions for the purposes of comparison. It is another form of normalizing text that may initially appear to be very similar to lower() because generally, the results are the same. As of Unicode 13.0.0, only ~300 of ~150,000 characters produced differing results when passed through lower() and casefold(). @dlukes’ answer has the code to identify the characters that generate those differing results.

To answer your other two questions:

  • use lower() when you specifically want to ensure a character is lowercase, like for presenting to users or persisting data
  • use casefold() when you want to compare that result to another casefold-ed value.

Other Material

I suggest you take a closer look into what case folding actually is, so here’s a good start: W3 Case Folding Wiki

Another source:
Elastic.co Case Folding

Edit: I just recently found another very good related answer to a slightly different question here on SO (doing a case-insensitive string comparison)


Performance

Using this snippet, you can get a sense for the performance between the two:

import sys
from timeit import timeit

unicode_codepoints = tuple(map(chr, range(sys.maxunicode)))

def compute_lower():
    return tuple(codepoint.lower() for codepoint in unicode_codepoints)

def compute_casefold():
    return tuple(codepoint.casefold() for codepoint in unicode_codepoints)

timer_repeat = 1000

print(f"time to compute lower on unicode namespace: {timeit(compute_lower, number = timer_repeat) / timer_repeat} seconds")
print(f"time to compute casefold on unicode namespace: {timeit(compute_casefold, number = timer_repeat) / timer_repeat} seconds")

print(f"number of distinct characters from lower: {len(set(compute_lower()))}")
print(f"number of distinct characters from casefold: {len(set(compute_casefold()))}")

Running this, you’ll get the results that the two are overwhelmingly the same in both performance and the number of distinct characters returned

time to compute lower on unicode namespace: 0.137255663 seconds
time to compute casefold on unicode namespace: 0.136321374 seconds
number of distinct characters from lower: 1112719
number of distinct characters from casefold: 1112694

If you run the numbers, that means it takes about 1.6e-07 seconds to run the computation on a single character for either function, so there isn’t a performance benefit either way.

Leave a Comment