Removing non-English words from text using Python
You can use the words corpus from NLTK: import nltk words = set(nltk.corpus.words.words()) sent = “Io andiamo to the beach with my amico.” ” “.join(w for w in nltk.wordpunct_tokenize(sent) \ if w.lower() in words or not w.isalpha()) # ‘Io to the beach with my’ Unfortunately, Io happens to be an English word. In general, it … Read more