Detect text language in R

The textcat package does this. It can detect 74 ‘languages’ (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:

Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.

Here’s the abstract:

Identifying the language used will typically be the first step in most
natural language processing tasks. Among the wide variety of language
identification methods discussed in the literature, the ones employing
the Cavnar and Trenkle (1994) approach to text categorization based on
character n-gram frequencies have been particularly successful. This
paper presents the R extension package textcat for n-gram based text
categorization which implements both the Cavnar and Trenkle approach
as well as a reduced n-gram approach designed to remove redundancies
of the original approach. A multi-lingual corpus obtained from the
Wikipedia pages available on a selection of topics is used to
illustrate the functionality of the package and the performance of the
provided language identification methods.

And here’s one of their examples:

library("textcat")
textcat(c(
  "This is an English sentence.",
  "Das ist ein deutscher Satz.",
  "Esta es una frase en espa~nol."))
[1] "english" "german" "spanish" 

Leave a Comment