ASCII vs Unicode + UTF-8
In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.
In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.
Check out the CharsetEncoder and CharsetDecoder API descriptions – You should follow a specific sequence of method calls to avoid this problem. For example, for CharsetEncoder: Reset the encoder via the reset method, unless it has not been used before; Invoke the encode method zero or more times, as long as additional input may be … Read more
ASCII was indeed originally conceived as a 7-bit code. This was done well before 8-bit bytes became ubiquitous, and even into the 1990s you could find software that assumed it could use the 8th bit of each byte of text for its own purposes (“not 8-bit clean”). Nowadays people think of it as an 8-bit … Read more
This can happen if you have a newline (or other control character) in a JSON string literal. {“foo”: “bar baz”} If you are the one producing the data, replace actual newlines with escaped ones “\\n” when creating your string literals. {“foo”: “bar\nbaz”}
Seeing ?????? instead of intelligible characters (and even instead of Mojibake) usually indicates that the data transfer responsible is by itself very well aware about the encoding used in both the source and the destination. In the average web application there are only 2 places where this is the case: the point when the data … Read more
You can use mb_convert_encoding() or htmlspecialchars()‘s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5. Recommended substitute character for invalid byte sequence is U+FFFD. see “3.1.2 Substituting for Ill-Formed Subsequences” in UTR #36: Unicode Security Considerations for the details. When using … Read more
You have to make sure the content is served with the proper character set: Either send the content with a header that includes <?php header(“Content-Type: text/html; charset=[your charset]”); ?> or – if the HTTP charset headers don’t exist – insert a <META> element into the <head>: <meta http-equiv=”Content-Type” content=”text/html; charset=[your charset]” /> Like the attribute … Read more
The number of ways this can get screwed up is actually quite impressive. If you’re using MySQL, try adding a characterEncoding=UTF-8 parameter to the end of your JDBC connection URL: jdbc:mysql://server/database?characterEncoding=UTF-8 You should also check that the table / column character set is UTF-8.
Here is a function you might find useful: utf8_to_latin9(). It converts to ISO-8859-15 (including EURO, which ISO-8859-1 does not have), but also works correctly for the UTF-8->ISO-8859-1 conversion part of a ISO-8859-1->UTF-8->ISO-8859-1 round-trip. The function ignores invalid code points similar to //IGNORE flag for iconv, but does not recompose decomposed UTF-8 sequences; that is, it … Read more
You need do distinguish between the source character set, the execution character set, the wire execution character set and it’s basic versions: The basic source character set: §2.1.1: The basic source character set consists of 96 characters […] This character set has exactly 96 characters. They fit into 7 bit. Characters like @ are not … Read more