In what JS engines, specifically, are toLowerCase & toUpperCase locale-sensitive?

Note: Please, note that I couldn’t test it!

String.prototype.toLowerCase ( )

[…]

For the purposes of this operation, the 16-bit code units of the
Strings are treated as code points in the Unicode Basic Multilingual
Plane. Surrogate code points are directly transferred from S to L
without any mapping.

The result must be derived according to the case mappings in the
Unicode character database (this explicitly includes not only the
UnicodeData.txt file, but also the SpecialCasings.txt file that
accompanies it in Unicode 2.1.8 and later).

[…]

String.prototype.toLocaleLowerCase ( )

This function works exactly the same as toLowerCase except that its
result is intended to yield the correct result for the host
environment’s current locale, rather than a locale-independent result.
There will only be a difference in the few cases (such as Turkish)
where the rules for that language conflict with the regular Unicode
case mappings.

[…]

And as per Unicode Character Database Special Casing:

[…]

Format

The entries in this file are in the following machine-readable format:

<code>; <lower>; <title>; <upper>; (<condition_list>;)? # <comment>

Unconditional mappings

[…]

Preserve canonical equivalence for I with dot. Turkic is handled
below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

[…]

Language-Sensitive Mappings
These are characters whose full case mappings depend on language and perhaps also
context (which characters come before or after). For more information
see the header of this file and the Unicode Standard.

Lithuanian

Lithuanian retains the dot in a lowercase i when followed by accents.

Remove DOT ABOVE after “i” with upper or titlecase

0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE

Introduce an explicit dot above when lowercasing capital I’s and J’s
whenever there are more accents above.
(of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)

0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I

004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J

012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK

00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE

00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE

0128; 0069 0307 0303; 0128; 0128; lt; #LATIN CAPITAL LETTER I WITH TILDE

Turkish and Azeri

I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE

0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE

0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

When lowercasing, unless an I is before a dot_above, it turns into a dotless i.

0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I

0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I

When uppercasing, i turns into a dotted capital I

0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I

0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

Note: the following case is already in the UnicodeData.txt file.

0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I

EOF

Also, as per JavaScript for Absolute Beginners (by Terry McNavage):

> "I".toLowerCase() // "i"
> "i".toUpperCase() // "I"
> "I".toLocaleLowerCase() // "<dotless-i>"
> "i".toLocaleUpperCase() // "<dotted-I>"
Note: toLocaleLowerCase() and toLocaleUpperCase() convert case based on your OS settings. You’d have to change those settings to Turkish for the previous sample to work. Or just take my word for it!

And as per bobince’s comment over Convert JavaScript String to be all lower case? question:

Accept-Language and navigator.language are two completely separate
settings. Accept-Language reflects the user’s chosen preferences for
what languages they want to receive in web pages (and this setting is
unfortuately inaccessible to JS). navigator.language merely reflects
which localisation of the web browser was installed, and should
generally not be used for anything. Both of these values are unrelated
to the system locale, which is the bit that decides what
toLocaleLowerCase() will do; that’s an OS-level setting out of scope
of the browser’s prefs.

So, setting lang="tr-TR" to html won’t reflect a real test case, since it’s an OS setting that’s required to reproduce the special casing example.

I think that only lowercasing dotted-I or uppercasing dotless-i would be locale specific when using toLowerCase() or toUpperCase().

As per those credible/official sources, I think you’re right: 'i' !== 'I'.toLowerCase() would always evaluate to false.

But, as I said, I couldn’t test it here.

String.prototype.toLowerCase ( )

String.prototype.toLocaleLowerCase ( )

Format

Unconditional mappings

Lithuanian

Turkish and Azeri

More Related Contents:

Leave a Comment Cancel reply