What is character encoding and why should I bother with it

(Note that I’m using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)

A byte can only have 256 distinct values, being 8 bits.

Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.

Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.

Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.

As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters 🙂 Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.

Leave a Comment