How to find out the Encoding of a File? C#

There is no reliable way to do it (since the file might be just random binary), however the process done by Windows Notepad software is detailed in Micheal S Kaplan’s blog:

http://www.siao2.com/2007/04/22/2239345.aspx

  1. Check the first two bytes;
    1. If there is a UTF-16 LE BOM, then treat it (and load it) as a “Unicode” file;
    2. If there is a UTF-16 BE BOM, then treat it (and load it) as a “Unicode (Big Endian)” file;
    3. If the first two bytes look like the start of a UTF-8 BOM, then check the next byte and if we have a UTF-8 BOM, then treat it (and load it) as a “UTF-8” file;
  2. Check with IsTextUnicode to see if that function think it is BOM-less UTF-16 LE, if so, then treat it (and load it) as a “Unicode” file;
  3. Check to see if it UTF-8 using the original RFC 2279 definition from 1998 and if it then treat it (and load it) as a “UTF-8” file;
  4. Assume an ANSI file using the default system code page of the machine.

Now note that there are some holes
here, like the fact that step 2 does
not do quite as good with BOM-less
UTF-16 BE (there may even be a bug
here, I’m not sure — if so it’s a bug
in Notepad beyond any bug in
IsTextUnicode).

Leave a Comment