utf-8
Best way to convert text files between character sets?
Stand-alone utility approach iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt -f ENCODING the encoding of the input -t ENCODING the encoding of the output You don’t have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.
Using PowerShell to write a file in UTF-8 without the BOM
Using .NET’s UTF8Encoding class and passing $False to the constructor seems to work: $MyRawString = Get-Content -Raw $MyPath $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False [System.IO.File]::WriteAllLines($MyPath, $MyRawString, $Utf8NoBomEncoding)
Byte order mark screws up file reading in Java
EDIT: I’ve made a proper release on GitHub: https://github.com/gpakosz/UnicodeBOMInputStream Here is a class I coded a while ago, I just edited the package name before pasting. Nothing special, it is quite similar to solutions posted in SUN’s bug database. Incorporate it in your code and you’re fine. /* ____________________________________________________________________________ * * File: UnicodeBOMInputStream.java * Author: … Read more
Why should we NOT use sys.setdefaultencoding(“utf-8”) in a py script?
As per the documentation: This allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode. This function is only available at Python start-up time, when Python scans the environment. It has to be called in … Read more
What’s the difference between utf8_general_ci and utf8_unicode_ci?
For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these. For example, utf8mb4_0900_ai_ci. All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared. _unicode_ci and _general_ci are two different sets of rules … Read more
How to decode Unicode escape sequences like “\u00ed” to proper UTF-8 encoded characters?
Try this: $str = preg_replace_callback(‘/\\\\u([0-9a-fA-F]{4})/’, function ($match) { return mb_convert_encoding(pack(‘H*’, $match[1]), ‘UTF-8’, ‘UCS-2BE’); }, $str); In case it’s UTF-16 based C/C++/Java/Json-style: $str = preg_replace_callback(‘/\\\\u([0-9a-fA-F]{4})/’, function ($match) { return mb_convert_encoding(pack(‘H*’, $match[1]), ‘UTF-8’, ‘UTF-16BE’); }, $str);
Unicode characters in URLs
Use percent encoding. Modern browsers will take care of display & paste issues and make it human-readable. E. g. http://ko.wikipedia.org/wiki/위키백과:대문 Edit: when you copy such an url in Firefox, the clipboard will hold the percent-encoded form (which is usually a good thing), but if you copy only a part of it, it will remain unencoded.
Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)
Note: This answer shows how to switch the character encoding in the Windows console to UTF-8 (code page 65001), so that shells such as cmd.exe and PowerShell properly encode and decode characters (text) when communicating with external (console) programs with full Unicode support, and in cmd.exe also for file I/O.[1] If, by contrast, your concern … Read more
Changing default encoding of Python?
Here is a simpler method (hack) that gives you back the setdefaultencoding() function that was deleted from sys: import sys # sys.setdefaultencoding() does not exist, here! reload(sys) # Reload does the trick! sys.setdefaultencoding(‘UTF8’) (Note for Python 3.4+: reload() is in the importlib library.) This is not a safe thing to do, though: this is obviously … Read more