unicode - w3toppers.com

Manually converting unicode codepoints into UTF-8 and UTF-16

Wow. On the one hand I’m thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?) The clearest description I’ve seen so far for the rules to encode … Read more

What’s the difference between Unicode and UTF-8? [duplicate]

As Rasmus states in his article “The difference between UTF-8 and Unicode?”: If asked the question, “What is the difference between UTF-8 and Unicode?”, would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand … Read more

What’s the difference between ASCII and Unicode?

ASCII defines 128 characters, which map to the numbers 0–127. Unicode defines (less than) 221 characters, which, similarly, map to numbers 0–221 (though not all numbers are currently assigned, and some are reserved). Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. For … Read more

Is there a Windows command shell that will display Unicode characters?

To do this with cmd.exe, you’ll need to use the console properties dialog to switch to a Unicode TrueType font. Then use these commands: CHCP 65001 DIR > UTF8.TXT TYPE UTF8.TXT Commands: Switch console to UTF-8 (65001) Redirect output of DIR to UTF8.TXT Dump UTF-8 to console The characters will still need to be supported … Read more

Unicode, UTF, ASCII, ANSI format differences

Going down your list: “Unicode” isn’t an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to … Read more

FPDF utf-8 encoding (HOW-TO)

Don’t use UTF-8 encoding. Standard FPDF fonts use ISO-8859-1 or Windows-1252. It is possible to perform a conversion to ISO-8859-1 with utf8_decode(): $str = utf8_decode($str); But some characters such as Euro won’t be translated correctly. If the iconv extension is available, the right way to do it is the following: $str = iconv(‘UTF-8’, ‘windows-1252’, $str);

UTF-8, UTF-16, and UTF-32

UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text, because UTF-8 encodes these into 8 bits (like ASCII). It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file. UTF-16 is better where ASCII … Read more

What is the proper way to URL encode Unicode characters?

I would always encode in UTF-8. From the Wikipedia page on percent encoding: The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, … Read more

Using awk to remove the Byte-order mark

Using GNU sed (on Linux or Cygwin): # Removing BOM from all text files in current directory: sed -i ‘1 s/^\xef\xbb\xbf//’ *.txt On FreeBSD: sed -i .bak ‘1 s/^\xef\xbb\xbf//’ *.txt Advantage of using GNU or FreeBSD sed: the -i parameter means “in place”, and will update files without the need for redirections or weird tricks. … Read more

What are Unicode, UTF-8, and UTF-16?

Why do we need Unicode? In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today’s strange world of global intercommunication and social media was not foreseen, and … Read more