First, the easy cases:
ASCII
If your data contains no bytes above 0x7F, then it’s ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)
UTF-8
If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8’s strict validation rules, false positives are extremely rare.
ISO-8859-1 vs. windows-1252
The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I’ve seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don’t even bother with them, or ISO-8859-1, just detect windows-1252 instead.
That now leaves you with only one question.
How do you distinguish MacRoman from cp1252?
This is a lot trickier.
Undefined characters
The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.
Identical characters
The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn’t matter whether you choose MacRoman or cp1252.
Statistical approach
Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.
For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—
. Based on this fact,
- The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
- The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.
Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.