how to detect invalid utf8 unicode/binary in a text file

Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences: grep -axv ‘.*’ file.txt Explanation (from grep man page): -a, –text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8) -v, –invert-match: inverts the output showing lines … Read more

GCC 4.7 Source Character Encoding and Execution Character Encoding For String Literals?

I don’t know how well these options actually work (not using them atm; I still prefer treating string literals as ‘ASCII only’, since localized strings come from external files anyway so it’s mostly things like format strings or filenames), but they have added options like -fexec-charset=charset Set the execution character set, used for string and … Read more

PHP generated XML shows invalid Char value 27 message

A useful function to get rid of that error is suggested on this website. http://www.phpwact.org/php/i18n/charsets#common_problem_areas_with_utf-8 When you put utf-8 encoded strings in a XML document you should remember that not all utf-8 valid chars are accepted in a XML document http://www.w3.org/TR/REC-xml/#charsets So you should strip away the unwanted chars, else you’ll have an XML fatal … Read more

Weird characters in URL

They are essentially malformed URLs. They can be generated from a specific malware that is trying to exploit web site vulnerabilities, from malfunctioning browser plugin or extension, or from a bug in a JS file (i.e. tracking with Google Analytics) in combination with a specific browser version/operating system. In any case, you can’t actually control … Read more

Powershell ConvertFrom-Json Encoding Special Characters Issue

Peter Schneider’s helpful answer and Nas’ helpful answer both address one problem with your approach: You need to: either: access the .Content property on the response object returned by Invoke-WebRequest to get the actual data returned (as a JSON string), which you can then pass to ConvertFrom-Json. or: use Invoke-RestMethod instead, which returns the data … Read more

How can I open files containing accents in Java?

First, the character encoding used is not directly related to the locale. So changing the locale won’t help much. Second, the � is typical for the Unicode replacement character U+FFFD � being printed in ISO-8859-1 instead of UTF-8. Here’s an evidence: System.out.println(new String(“�”.getBytes(“UTF-8”), “ISO-8859-1”)); // � So there are two problems: Your JVM is reading … Read more