Force character vector encoding from “unknown” to “UTF-8” in R

The Encoding function returns unknown if a character string has a “native encoding” mark (CP-1250 in your case) or if it’s in ASCII.
To discriminate between these two cases, call:

library(stringi)
stri_enc_mark(poli.dt$word)

To check whether each string is a valid UTF-8 byte sequence, call:

all(stri_enc_isutf8(poli.dt$word))

If it’s not the case, your file is definitely not in UTF-8.

I suspect that you haven’t forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:

read.csv2(file("filename", encoding="UTF-8"))

or

poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings

If data.table still complains about the “mixed” encodings, you may want to transliterate the non-ASCII characters, e.g.:

stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"

Leave a Comment