how to detect invalid utf8 unicode/binary in a text file

Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:

grep -axv '.*' file.txt

Explanation (from grep man page):

  • -a, –text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
  • -v, –invert-match: inverts the output showing lines not matched
  • -x ‘.*’ (–line-regexp): means to match a complete line consisting of any utf8 character.

Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)

Leave a Comment