Can UTF-8 contain zero byte?

Yes, the zero byte in UTF8 is code point 0, NUL. There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

The possible code points and their UTF8 encoding are:

Range              Encoding  Binary value
-----------------  --------  --------------------------
U+000000-U+00007f  0xxxxxxx  0xxxxxxx

U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                   10xxxxxx

U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                   10yyyyxx
                   10xxxxxx

U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                   10zzyyyy
                   10yyyyxx
                   10xxxxxx

You can see that all the non-zero ASCII characters are represented as themselves while all mutibyte sequences have a high bit of 1 in all their bytes.

You may need to be careful that your ascii plaintext protocol doesn’t treat non-ASCII characters badly (since that will be all non-ASCII code points).

Leave a Comment