wordnet lemmatization and pos tagging in python

First of all, you can use nltk.pos_tag() directly without training it.
The function will load a pretrained tagger from a file. You can see the file name
with nltk.tag._POS_TAGGER:

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle' 

As it was trained with the Treebank corpus, it also uses the Treebank tag set.

The following function would map the treebank tags to WordNet part of speech names:

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

You can then use the return value with the lemmatizer:

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'

Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError.

Leave a Comment