Ensuring valid UTF-8 in PHP

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don’t have to worry about losing any characters when you convert a string from any other encoding to UTF-8. Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where … Read more

System.Net.Mail and =?utf-8?B?XXXXX…. Headers

When your subject contains characters outside the ASCII range, then the mailing software must encode them (RFC2822 mail does not permit non-ASCII characters in headers). There are two ways to do this: Quoted Printable (subject starts with “=?utf-8?Q”) Base64 (subject starts with “=?utf-8?B”) It appears that the framework has figured that the Base64 encoding is … Read more

UTF-8 Continuation bytes

A continuation byte in UTF-8 is any byte where the top two bits are 10. They are the subsequent bytes in multi-byte sequences. The following table may help: Unicode code points Encoding Binary value ——————- ——– ———— U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy … Read more

Python 3 CSV file giving UnicodeDecodeError: ‘utf-8’ codec can’t decode byte error when I print

We know the file contains the byte b’\x96′ since it is mentioned in the error message: UnicodeDecodeError: ‘utf-8′ codec can’t decode byte 0x96 in position 7386: invalid start byte Now we can write a little script to find out if there are any encodings where b’\x96’ decodes to ñ: import pkgutil import encodings import os … Read more