UTF-8 or UTF-16 or UTF-32 or UCS-2

So i am now confused what to use now
UTF-8 / UTF-16 / UTF-32 / UCS-2

which is better for Multilingual
content and performance etc.

UCS-2 is obsolete: It can no longer represent every Unicode character. UTF-8, UTF-16, and UTF-32 all can. But why have three different ways to encode the same characters?

Because in the old days, programmers made two big assumptions about strings.

  1. That strings consist of 8-bit code units.
  2. That 1 character = 1 code unit.

The problem for multilingual text (or even for monolingual text if that language happened to be Chinese, Japanese, or Korean) is that these two assumptions combined limit you to 256 characters. If you need to represent more than that, you need to drop one of the assumptions.

Keeping assumption #1 and dropping assumption #2 gives you a variable-width (or multi-byte) encoding. Today, the most popular variable-width encoding is UTF-8.

Dropping assumption #1 and keeping assumption #2 gives you a wide-character encoding. Unicode and UCS-2 were originally designed to use a 16-bit fixed-width encoding, which would allow for 65,536 characters. Early adopters of Unicode, such as Sun (for Java) and Microsoft (for NT) used UCS-2.

However, a few years later, it was realized that even that wasn’t enough for everybody, so the Unicode code range was expanded. Now if you want a fixed-width encoding, you have to use UTF-32.

But Sun and Microsoft had written huge APIs based around 16-bit characters, and weren’t enthusiastic about rewriting them for 32-bit. Fortunately, there was still a block of 2048 unassigned characters out of the original 65,536-character “Basic Multilingual Plane”, which could be assigned as “surrogates” to be used in pairs to represent supplementary characters: the UTF-16 encoding form. Unfortunately, UTF-16 meets neither of the original two assumptions: It’s both non-8-bit and variable-width.

In summary:

Use UTF-8 when the assumption of 8-bit code units is important.

This applies to:

  • Filenames and related OS calls on Unix systems, which had an established tradition of allowing variable-width encodings, but can’t accept '\x00 bytes within strings and thus can’t use UTF-16 or UTF-32. In fact, UTF-8 was originally designed for a Unix-based OS (Plan 9).
  • Communications protocols designed around streams of octets.
  • Anything that requires binary compatibility with US-ASCII but gives no special treatment to byte values above 127.

Use UTF-32 when the assumption of a fixed-width encoding is important.

This is useful when you care about the properties of characters as opposed to their encoding, such as the Unicode equivalents to the ctypes.h functions like isalpha, isdigit, toupper, etc.

Use UTF-16 when neither assumption is that important, but your platform used to use UCS-2.

Are you writing for Windows, or for the .NET framework designed for it? For Java? Then UTF-16 is your default string type; might as well use it.

Since you are using C#, all of your strings will be encoded in UTF-16. ASP.NET will encode the actual HTML pages in UTF-8, but this is done behind the scenes and you don’t need to care.

Size considerations

The three UTF encoding forms require different amounts of memory to represent a character:

  • Characters U+0000 to U+007F (ASCII) require 1 byte in UTF-8, 2 bytes in UTF-16, or 4 bytes in UTF-32.
  • Characters U+0080 to U+07FF (IPA symbols, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, NKo) require 2 bytes in UTF-8, 2 bytes in UTF-16, or 4 bytes in UTF-32.
  • Characters U+0800 to U+FFFF (the rest of the BMP, mostly for Asian languages) require 3 bytes in UTF-8, 2 bytes in UTF-16, or 4 bytes in UTF-32.
  • Characters U+10000 to U+10FFFF require 4 bytes in all three encoding forms.

Thus, if you want to save space, use UTF-8 if your characters are mostly ASCII, or UTF-16 if your characters are mostly Asian.

Leave a Comment