How to remove English words from a file containing Dari words?

You could install and use the nltk library. This provides you with a list of English words and a means to split each line into words:

from nltk.tokenize import word_tokenize
from nltk.corpus import words

english = words.words()

with open('Dari.pos') as f_input, open('DariNER.txt', 'w') as f_output:
    for line in f_input:
        f_output.write(' '.join(word for word in word_tokenize(line) if word.lower() not in english) + '\n')

After installing nltk, you should run:

import nltk
nltk.download()

and use it to download words

Leave a Comment