C programming: How to program for Unicode?

C99 or earlier

The C standard (C99) provides for wide characters and multi-byte characters, but since there is no guarantee about what those wide characters can hold, their value is somewhat limited. For a given implementation, they provide useful support, but if your code must be able to move between implementations, there is insufficient guarantee that they will be useful.

Consequently, the approach suggested by Hans van Eck (which is to write a wrapper around the ICU – International Components for Unicode – library) is sound, IMO.

The UTF-8 encoding has many merits, one of which is that if you do not mess with the data (by truncating it, for example), then it can be copied by functions that are not fully aware of the intricacies of UTF-8 encoding. This is categorically not the case with wchar_t.

Unicode in full is a 21-bit format. That is, Unicode reserves code points from U+0000 to U+10FFFF.

One of the useful things about the UTF-8, UTF-16 and UTF-32 formats (where UTF stands for Unicode Transformation Format – see Unicode) is that you can convert between the three representations without loss of information. Each can represent anything the others can represent. Both UTF-8 and UTF-16 are multi-byte formats.

UTF-8 is well known to be a multi-byte format, with a careful structure that makes it possible to find the start of characters in a string reliably, starting at any point in the string. Single-byte characters have the high-bit set to zero. Multi-byte characters have the first character starting with one of the bit patterns 110, 1110 or 11110 (for 2-byte, 3-byte or 4-byte characters), with subsequent bytes always starting 10. The continuation characters are always in the range 0x80 .. 0xBF. There are rules that UTF-8 characters must be represented in the minimum possible format. One consequence of these rules is that the bytes 0xC0 and 0xC1 (also 0xF5..0xFF) cannot appear in valid UTF-8 data.

 U+0000 ..   U+007F  1 byte   0xxx xxxx
 U+0080 ..   U+07FF  2 bytes  110x xxxx   10xx xxxx
 U+0800 ..   U+FFFF  3 bytes  1110 xxxx   10xx xxxx   10xx xxxx
U+10000 .. U+10FFFF  4 bytes  1111 0xxx   10xx xxxx   10xx xxxx   10xx xxxx

Originally, it was hoped that Unicode would be a 16-bit code set and everything would fit into a 16-bit code space. Unfortunately, the real world is more complex, and it had to be expanded to the current 21-bit encoding.

UTF-16 thus is a single unit (16-bit word) code set for the ‘Basic Multilingual Plane’, meaning the characters with Unicode code points U+0000 .. U+FFFF, but uses two units (32-bits) for characters outside this range. Thus, code that works with the UTF-16 encoding must be able to handle variable width encodings, just like UTF-8 must. The codes for the double-unit characters are called surrogates.

Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.

UTF-32, of course, can encode any Unicode code point in a single unit of storage. It is efficient for computation but not for storage.

You can find a lot more information at the ICU and Unicode web sites.

C11 and <uchar.h>

The C11 standard changed the rules, but not all implementations have caught up with the changes even now (mid-2017). The C11 standard summarizes the changes for Unicode support as:

  • Unicode characters and strings (<uchar.h>) (originally specified in
    ISO/IEC TR 19769:2004)

What follows is a bare minimal outline of the functionality. The specification includes:

6.4.3 Universal character names

Syntax
universal-character-name:
    \u hex-quad
    \U hex-quad hex-quad
hex-quad:
    hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit

7.28 Unicode utilities <uchar.h>

The header <uchar.h> declares types and functions for manipulating Unicode characters.

The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);

char16_t

which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and

char32_t

which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).

(Translating the cross-references: <stddef.h> defines size_t,
<wchar.h> defines mbstate_t,
and <stdint.h> defines uint_least16_t and uint_least32_t.)
The <uchar.h> header also defines a minimal set of (restartable) conversion functions:

  • mbrtoc16()
  • c16rtomb()
  • mbrtoc32()
  • c32rtomb()

There are rules about which Unicode characters can be used in identifiers using the \unnnn or \U00nnnnnn notations. You may have to actively activate the support for such characters in identifiers. For example, GCC requires -fextended-identifiers to allow these in identifiers.

Note that macOS Sierra (10.12.5), to name but one platform, does not support <uchar.h>.

Leave a Comment