Byte order mark screws up file reading in Java

EDIT: I’ve made a proper release on GitHub: https://github.com/gpakosz/UnicodeBOMInputStream Here is a class I coded a while ago, I just edited the package name before pasting. Nothing special, it is quite similar to solutions posted in SUN’s bug database. Incorporate it in your code and you’re fine. /* ____________________________________________________________________________ * * File: UnicodeBOMInputStream.java * Author: … Read more

What’s the difference between utf8_general_ci and utf8_unicode_ci?

For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these. For example, utf8mb4_0900_ai_ci. All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared. _unicode_ci and _general_ci are two different sets of rules … Read more

How to decode Unicode escape sequences like “\u00ed” to proper UTF-8 encoded characters?

Try this: $str = preg_replace_callback(‘/\\\\u([0-9a-fA-F]{4})/’, function ($match) { return mb_convert_encoding(pack(‘H*’, $match[1]), ‘UTF-8’, ‘UCS-2BE’); }, $str); In case it’s UTF-16 based C/C++/Java/Json-style: $str = preg_replace_callback(‘/\\\\u([0-9a-fA-F]{4})/’, function ($match) { return mb_convert_encoding(pack(‘H*’, $match[1]), ‘UTF-8’, ‘UTF-16BE’); }, $str);

Unicode characters in URLs

Use percent encoding. Modern browsers will take care of display & paste issues and make it human-readable. E. g. http://ko.wikipedia.org/wiki/위키백과:대문 Edit: when you copy such an url in Firefox, the clipboard will hold the percent-encoded form (which is usually a good thing), but if you copy only a part of it, it will remain unencoded.

Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)

Note: This answer shows how to switch the character encoding in the Windows console to UTF-8 (code page 65001), so that shells such as cmd.exe and PowerShell properly encode and decode characters (text) when communicating with external (console) programs with full Unicode support, and in cmd.exe also for file I/O.[1] If, by contrast, your concern … Read more

Changing default encoding of Python?

Here is a simpler method (hack) that gives you back the setdefaultencoding() function that was deleted from sys: import sys # sys.setdefaultencoding() does not exist, here! reload(sys) # Reload does the trick! sys.setdefaultencoding(‘UTF8’) (Note for Python 3.4+: reload() is in the importlib library.) This is not a safe thing to do, though: this is obviously … Read more