Why does modern Perl avoid UTF-8 by default?

π™Žπ™žπ™’π™₯π™‘π™šπ™¨π™© β„ž: πŸ• π˜Ώπ™žπ™¨π™˜π™§π™šπ™©π™š π™π™šπ™˜π™€π™’π™’π™šπ™£π™™π™–π™©π™žπ™€π™£π™¨ Set your PERL_UNICODE envariable to AS. This makes all Perl scripts decode @ARGV as UTF‑8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are global effects, not lexical ones. At the top of your source file (program, module, library, dohickey), prominently … Read more

Saving utf-8 texts with json.dumps as UTF8, not as \u escape sequence

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually: >>> json_string = json.dumps(“Χ‘Χ¨Χ™ Χ¦Χ§ΧœΧ””, ensure_ascii=False).encode(‘utf8’) >>> json_string b'”\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94″‘ >>> print(json_string.decode()) “Χ‘Χ¨Χ™ Χ¦Χ§ΧœΧ”” If you are writing to a file, just use json.dump() and leave it to the file object to encode: with open(‘filename’, ‘w’, encoding=’utf8′) as json_file: json.dump(“Χ‘Χ¨Χ™ Χ¦Χ§ΧœΧ””, json_file, … Read more

How to get UTF-8 working in Java webapps?

Answering myself as the FAQ of this site encourages it. This works for me: Mostly characters Γ€Γ₯ΓΆ are not a problematic as the default character set used by browsers and tomcat/java for webapps is latin1 ie. ISO-8859-1 which “understands” those characters. To get UTF-8 working under Java+Tomcat+Linux/Windows+Mysql requires the following: Configuring Tomcat’s server.xml It’s necessary … Read more

Setting the default Java character encoding

Unfortunately, the file.encoding property has to be specified as the JVM starts up; by the time your main method is entered, the character encoding used by String.getBytes() and the default constructors of InputStreamReader and OutputStreamWriter has been permanently cached. As Edward Grech points out, in a special case like this, the environment variable JAVA_TOOL_OPTIONS can … Read more

Trouble with UTF-8 characters; what I see is not what I stored

This problem plagues the participants of this site, and many others. You have listed the five main cases of CHARACTER SET troubles. Best Practice Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.) utf8mb4 is a superset of utf8 … Read more

UTF-8 all the way through

Data Storage: Specify the utf8mb4 character set on all tables and text columns in your database. This makes MySQL physically store and retrieve values encoded natively in UTF-8. Note that MySQL will implicitly use utf8mb4 encoding if a utf8mb4_* collation is specified (without any explicit character set). In older versions of MySQL (< 5.5.3), you’ll … Read more