What is the most accurate encoding detector? [closed]

I’ve checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent:
juniversalchardet had better results:

  • UTF-8: Both detected.
  • Windows-1255: juniversalchardet detected when it had enough hebrew letters, ICU4J still thought it was ISO-8859-1. With even more hebrew letters, ICU4J detected it as ISO-8859-8 which is the other hebrew encoding(and so the text was OK).
  • SHIFT_JIS(Japanese): juniversalchardet detected and ICU4J thought it was ISO-8859-2.
  • ISO-8859-1: detected by ICU4J, not supported by juniversalchardet.

So one should consider which encodings he will most likely have to deal with.
In the end I chose ICU4J.

Notice that ICU4J is still maintained.

Also notice that you may want to use ICU4J, and in case that it returns null because it didn’t succeed, try to use juniversalchardet. Or the opposite.

AutoDetectReader of Apache Tika does exactly this – first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).

Leave a Comment