Java 8 UTF-8 encoding issue (java bug?)

It is a property of the Modified UTF-8 encoding to store surrogate pairs (or even unpaired chars of that range) like individual characters. And it’s an error if a decoder claiming to use standard UTF-8 uses “Modified UTF-8”. This seems to have been fixed with Java 8.

You can reliably read such data using a method that is specified to use “Modified UTF-8”:

ByteBuffer bb=ByteBuffer.allocate(array.length+2);
bb.putShort((short)array.length).put(array);
ByteArrayInputStream bis=new ByteArrayInputStream(bb.array());
DataInputStream dis=new DataInputStream(bis);
String str=dis.readUTF();

Leave a Comment