Java: Converting String to and from ByteBuffer and associated problems

Check out the CharsetEncoder and CharsetDecoder API descriptions – You should follow a specific sequence of method calls to avoid this problem. For example, for CharsetEncoder: Reset the encoder via the reset method, unless it has not been used before; Invoke the encode method zero or more times, as long as additional input may be … Read more

Unicode characters in servlet application are shown as question marks

Seeing ?????? instead of intelligible characters (and even instead of Mojibake) usually indicates that the data transfer responsible is by itself very well aware about the encoding used in both the source and the destination. In the average web application there are only 2 places where this is the case: the point when the data … Read more

Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

You can use mb_convert_encoding() or htmlspecialchars()‘s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5. Recommended substitute character for invalid byte sequence is U+FFFD. see “3.1.2 Substituting for Ill-Formed Subsequences” in UTR #36: Unicode Security Considerations for the details. When using … Read more

Browser displays � instead of ´

You have to make sure the content is served with the proper character set: Either send the content with a header that includes <?php header(“Content-Type: text/html; charset=[your charset]”); ?> or – if the HTTP charset headers don’t exist – insert a <META> element into the <head>: <meta http-equiv=”Content-Type” content=”text/html; charset=[your charset]” /> Like the attribute … Read more

Is there a way to convert from UTF8 to ISO-8859-1?

Here is a function you might find useful: utf8_to_latin9(). It converts to ISO-8859-15 (including EURO, which ISO-8859-1 does not have), but also works correctly for the UTF-8->ISO-8859-1 conversion part of a ISO-8859-1->UTF-8->ISO-8859-1 round-trip. The function ignores invalid code points similar to //IGNORE flag for iconv, but does not recompose decomposed UTF-8 sequences; that is, it … Read more

Character sets – Not clear

You need do distinguish between the source character set, the execution character set, the wire execution character set and it’s basic versions: The basic source character set: §2.1.1: The basic source character set consists of 96 characters […] This character set has exactly 96 characters. They fit into 7 bit. Characters like @ are not … Read more