utf-8 - w3toppers.com

UnicodeDecodeError when performing os.walk

Right I just spent some time sorting through this error, and wordier answers here aren’t getting at the underlying issue: The problem is, if you pass a unicode string into os.walk(), then os.walk starts getting unicode back from os.listdir() and tries to keep it as ASCII (hence ‘ascii’ decode error). When it hits a unicode … Read more

Ensuring valid UTF-8 in PHP

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don’t have to worry about losing any characters when you convert a string from any other encoding to UTF-8. Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where … Read more

System.Net.Mail and =?utf-8?B?XXXXX…. Headers

When your subject contains characters outside the ASCII range, then the mailing software must encode them (RFC2822 mail does not permit non-ASCII characters in headers). There are two ways to do this: Quoted Printable (subject starts with “=?utf-8?Q”) Base64 (subject starts with “=?utf-8?B”) It appears that the framework has figured that the Base64 encoding is … Read more

What’s the best way to export UTF8 data into Excel?

I found that if you set the charset encoding of the web page to utf-8, and then Response.BinaryWrite the UTF-8 Byte Order Mark (0xEF 0xBB 0xBF) at the top of the csv file, then Excel 2007 (not sure about other versions) will recognize it as utf-8 and open it correctly.

How to remove invalid UTF-8 characters from a JavaScript string?

I use this simple and sturdy approach: function cleanString(input) { var output = “”; for (var i=0; i<input.length; i++) { if (input.charCodeAt(i) <= 127) { output += input.charAt(i); } } return output; } Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it’s a good … Read more

UTF-8 Continuation bytes

A continuation byte in UTF-8 is any byte where the top two bits are 10. They are the subsequent bytes in multi-byte sequences. The following table may help: Unicode code points Encoding Binary value ——————- ——– ———— U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy … Read more

Rails 3 invalid multibyte char (US-ASCII)

Instead of adding # coding: UTF-8 try to add # encoding: UTF-8 on the first line of the file. It worked for me. I found the information here : http://groups.google.com/group/sinatrarb/browse_thread/thread/f92529bf0cf62015

Why is the return value of String.addingPercentEncoding() optional?

I filed a bug report with Apple about this, and heard back — with a very helpful response, no less! Turns out (much to my surprise) that it’s possible to successfully create Swift strings that contain invalid Unicode in the form of unpaired UTF-16 surrogate chars. Such a string can cause UTF-8 encoding to fail. … Read more

MySQL throws Incorrect string value error

It’s the character at the end of the tweet that’s causing the problem. It looks like an ’emoji’ character aka japanese smiley face but it’s not displaying for me in either Chrome or Safari. There are known issues storing 4byte utf characters in some versions of MySQL. Apparently you must use utf8mb4 to represent 4 … Read more

Python 3 CSV file giving UnicodeDecodeError: ‘utf-8’ codec can’t decode byte error when I print

We know the file contains the byte b’\x96′ since it is mentioned in the error message: UnicodeDecodeError: ‘utf-8′ codec can’t decode byte 0x96 in position 7386: invalid start byte Now we can write a little script to find out if there are any encodings where b’\x96’ decodes to ñ: import pkgutil import encodings import os … Read more