How to get code point number for a given character in a utf-8 string?

Use an existing utility such as iconv, or whatever libraries come with the language you’re using.

If you insist on rolling your own solution, read up on the UTF-8 format. Basically, each code point is stored as 1-4 bytes, depending on the value of the code point. The ranges are as follows:

U+0000 — U+007F: 1 byte: 0xxxxxxx
U+0080 — U+07FF: 2 bytes: 110xxxxx 10xxxxxx
U+0800 — U+FFFF: 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
U+10000 — U+10FFFF: 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Where each x is a data bit. Thus, you can tell how many bytes compose each code point by looking at the first byte: if it begins with a 0, it’s a 1-byte character. If it begins with 110, it’s a 2-byte character. If it begins with 1110, it’s a 3-byte character. If it begins with 11110, it’s a 4-byte character. If it begins with 10, it’s a non-initial byte of a multibyte character. If it begins with 11111, it’s an invalid character.

Once you figure out how many bytes are in the character, it’s just a matter if bit twiddling. Also note that UCS-2 cannot represent characters above U+FFFF.

Since you didn’t specify a language, here’s some sample C code (error checking omitted):

wchar_t utf8_char_to_ucs2(const unsigned char *utf8)
{
  if(!(utf8[0] & 0x80))      // 0xxxxxxx
    return (wchar_t)utf8[0];
  else if((utf8[0] & 0xE0) == 0xC0)  // 110xxxxx
    return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));
  else if((utf8[0] & 0xF0) == 0xE0)  // 1110xxxx
    return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));
  else
    return ERROR;  // uh-oh, UCS-2 can't handle code points this high
}

More Related Contents:

Leave a Comment Cancel reply