UTF-16 to UTF-8 conversion (for scripting in Windows)
There is a GNU tool recode which you can also use on Windows. E.g. recode utf16..utf8 text.txt
There is a GNU tool recode which you can also use on Windows. E.g. recode utf16..utf8 text.txt
It sounds like tho book is saying that ‘ℤ’ is not a UTF-16 character in the basic multilingual plane, but in fact it is. Java uses UTF-16 with surrogate pairs for characters that are not in the basic multilingual plane. Since ‘ℤ’ (0x2124) is in the basic multilingual plane it is represented by a single … Read more
Javascript uses UCS-2 internally, which is not UTF-16. It is very difficult to handle Unicode in Javascript because of this, and I do not suggest attempting to do so. As for what Twitter does, you seem to be saying that it is sanely counting by code point not insanely by code unit. Unless you have … Read more
Right I just spent some time sorting through this error, and wordier answers here aren’t getting at the underlying issue: The problem is, if you pass a unicode string into os.walk(), then os.walk starts getting unicode back from os.listdir() and tries to keep it as ASCII (hence ‘ascii’ decode error). When it hits a unicode … Read more
I found that if you set the charset encoding of the web page to utf-8, and then Response.BinaryWrite the UTF-8 Byte Order Mark (0xEF 0xBB 0xBF) at the top of the csv file, then Excel 2007 (not sure about other versions) will recognize it as utf-8 and open it correctly.
Your point about UTF-8 not being able to store all characters you need is invalid. UTF-8 is able to store every character defined in the Unicode standard. The only difference is that, for text in certain languages, UTF-8 can take more space to store its codepoints than, say, UTF-16. The opposite is also true: for … Read more
The latest version of golang.org/x/text/encoding/unicode makes it easier to do this because it includes unicode.BOMOverride, which will intelligently interpret the BOM. Here is ReadFileUTF16(), which is like os.ReadFile() but decodes UTF-16. package main import ( “bytes” “fmt” “io/ioutil” “log” “strings” “golang.org/x/text/encoding/unicode” “golang.org/x/text/transform” ) // Similar to ioutil.ReadFile() but decodes UTF-16. Useful when // reading data … Read more
Change encoding to UTF-8 with PowerShell: Get-Content PATH\temp.txt -Encoding Unicode | Set-Content -Encoding UTF8 PATH2\temp.txt
This is the difference between UTF-16LE and UTF-16 UTF-16LE is little endian without a BOM UTF-16 is big or little endian with a BOM So when you use UTF-16LE, the BOM is just part of the text. Use UTF-16 instead, so the BOM is automatically removed. The reason UTF-16LE and UTF-16BE exist is so people … Read more
The Unicode standard’s Unicode® Technical Report #51 includes a list of emoji (emoji-data.txt): … 21A9 ; text ; L1 ; none ; j # V1.1 (↩) LEFTWARDS ARROW WITH HOOK 21AA ; text ; L1 ; none ; j # V1.1 (↪) RIGHTWARDS ARROW WITH HOOK 231A ; emoji ; L1 ; none ; j … Read more