character-encoding - w3toppers.com

ASCII vs Unicode + UTF-8

In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.

Java: Converting String to and from ByteBuffer and associated problems

Check out the CharsetEncoder and CharsetDecoder API descriptions – You should follow a specific sequence of method calls to avoid this problem. For example, for CharsetEncoder: Reset the encoder via the reset method, unless it has not been used before; Invoke the encode method zero or more times, as long as additional input may be … Read more

Is ASCII code in matter of fact 7 bit or 8 bit?

ASCII was indeed originally conceived as a 7-bit code. This was done well before 8-bit bytes became ubiquitous, and even into the 1990s you could find software that assumed it could use the 8th bit of each byte of text for its own purposes (“not 8-bit clean”). Nowadays people think of it as an 8-bit … Read more

JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10)

This can happen if you have a newline (or other control character) in a JSON string literal. {“foo”: “bar baz”} If you are the one producing the data, replace actual newlines with escaped ones “\\n” when creating your string literals. {“foo”: “bar\nbaz”}

Unicode characters in servlet application are shown as question marks

Seeing ?????? instead of intelligible characters (and even instead of Mojibake) usually indicates that the data transfer responsible is by itself very well aware about the encoding used in both the source and the destination. In the average web application there are only 2 places where this is the case: the point when the data … Read more

Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

You can use mb_convert_encoding() or htmlspecialchars()‘s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5. Recommended substitute character for invalid byte sequence is U+FFFD. see “3.1.2 Substituting for Ill-Formed Subsequences” in UTR #36: Unicode Security Considerations for the details. When using … Read more

Browser displays � instead of ´

You have to make sure the content is served with the proper character set: Either send the content with a header that includes <?php header(“Content-Type: text/html; charset=[your charset]”); ?> or – if the HTTP charset headers don’t exist – insert a <META> element into the <head>: <meta http-equiv=”Content-Type” content=”text/html; charset=[your charset]” /> Like the attribute … Read more

Java PreparedStatement UTF-8 character problem

The number of ways this can get screwed up is actually quite impressive. If you’re using MySQL, try adding a characterEncoding=UTF-8 parameter to the end of your JDBC connection URL: jdbc:mysql://server/database?characterEncoding=UTF-8 You should also check that the table / column character set is UTF-8.

Is there a way to convert from UTF8 to ISO-8859-1?

Here is a function you might find useful: utf8_to_latin9(). It converts to ISO-8859-15 (including EURO, which ISO-8859-1 does not have), but also works correctly for the UTF-8->ISO-8859-1 conversion part of a ISO-8859-1->UTF-8->ISO-8859-1 round-trip. The function ignores invalid code points similar to //IGNORE flag for iconv, but does not recompose decomposed UTF-8 sequences; that is, it … Read more

Character sets – Not clear

You need do distinguish between the source character set, the execution character set, the wire execution character set and it’s basic versions: The basic source character set: §2.1.1: The basic source character set consists of 96 characters […] This character set has exactly 96 characters. They fit into 7 bit. Characters like @ are not … Read more