Removing non-English words from text using Python

You can use the words corpus from NLTK:

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'

Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.

More Related Contents:

look through a very large numbers (1.2e+34) in Python
Tuples conversion into JSON with python [closed]
Unable to allocate array with shape and data type
‘Conda’ is not recognized as internal or external command
ValueError: Wrong number of items passed – Meaning and suggestions?
How to plot multiple pandas columns
Cannot import name ‘CRS’ from ‘pyproj’ for using the osmnx library
Scikit-learn’s LabelBinarizer vs. OneHotEncoder
difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
How can repetitive rows of data be collected in a single row in pandas?
Where do I call the BatchNormalization function in Keras?
Python pandas groupby aggregate on multiple columns, then pivot
How to remove items from a list while iterating?
Text progress bar in terminal with block characters [closed]
How do I check what version of Python is running my script?
What does the percentage sign mean in Python
pandas dataframe str.contains() AND operation
How to insert a character after every 2 characters in a string
Bin size in Matplotlib (Histogram)
How can I check whether a numpy array is empty or not?
Run local python script on remote server
is there a way to loop over two lists simultaneously in django?
Python script to list users and groups
Add Jar to standalone pyspark
Mass DM bot was working fine and now it wont send messages
How to store an image in a variable
How to find the intersection of two graphs
Size of a Python list in memory
Using VirtualEnv with multiple Python versions on windows
Setting styles in Openpyxl

More Related Contents:

Leave a Comment Cancel reply