nltk - w3toppers.com

What is NLTK POS tagger asking me to download?

From NLTK versions higher than v3.2, please use: >>> import nltk >>> nltk.__version__ ‘3.2.1’ >>> nltk.download(‘averaged_perceptron_tagger’) [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /home/alvas/nltk_data… [nltk_data] Package averaged_perceptron_tagger is already up-to-date! True For NLTK versions using the old MaxEnt model, i.e. v3.1 and below, please use: >>> import nltk >>> nltk.download(‘maxent_treebank_pos_tagger’) [nltk_data] Downloading package maxent_treebank_pos_tagger to [nltk_data] … Read more

NLTK WordNet Lemmatizer: Shouldn’t it lemmatize all inflections of a word?

The WordNet lemmatizer does take the POS tag into account, but it doesn’t magically determine it: >>> nltk.stem.WordNetLemmatizer().lemmatize(‘loving’) ‘loving’ >>> nltk.stem.WordNetLemmatizer().lemmatize(‘loving’, ‘v’) u’love’ Without a POS tag, it assumes everything you feed it is a noun. So here it thinks you’re passing it the noun “loving” (as in “sweet loving”).

Save Naive Bayes Trained Classifier in NLTK

To save: import pickle f = open(‘my_classifier.pickle’, ‘wb’) pickle.dump(classifier, f) f.close() To load later: import pickle f = open(‘my_classifier.pickle’, ‘rb’) classifier = pickle.load(f) f.close()

tag generation from a text content

One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term ‘Markov’ is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. … Read more

What is the difference between lemmatization vs stemming?

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope … Read more

NLTK v3.2: Unable to nltk.pos_tag()

EDITED This issue has been resolved from NLTK v3.2.1. Upgrading your NLTK version would resolve the issue, e.g. pip install -U nltk. I faced the same issue and the error encountered was as follows; Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\tag\__init__.py”, line 110, in pos_tag tagger = PerceptronTagger() File … Read more

Using my own corpus instead of movie_reviews corpus for Classification in NLTK

If you have you data in exactly the same structure as the movie_review corpus in NLTK, there are two ways to “hack” your way through: 1. Put your corpus directory into where you save the nltk.data First check where is your nltk.data saved: >>> import nltk >>> nltk.data.find(‘corpora/movie_reviews’) FileSystemPathPointer(u’/home/alvas/nltk_data/corpora/movie_reviews’) Then move your directory to where … Read more

Creating a custom categorized corpus in NLTK and Python

Here is the answer to my question. Since I was thinking about using two cases I think it’s good to cover both in case someone needs the answer in the future. If you have the same setup as the movie_review corpus – several folders labeled in the same way you would like your labels to … Read more

Saving nltk drawn parse tree to image file

Using the nltk.draw.tree.TreeView object to create the canvas frame automatically: >>> from nltk.tree import Tree >>> from nltk.draw.tree import TreeView >>> t = Tree.fromstring(‘(S (NP this tree) (VP (V is) (AdjP pretty)))’) >>> TreeView(t)._cframe.print_to_file(‘output.ps’) Then: >>> import os >>> os.system(‘convert output.ps output.png’) [output.png]:

Tokenize a paragraph into sentence and then into words in NLTK

You probably intended to loop over sent_text: import nltk sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences # now loop over each sentence and tokenize it separately for sentence in sent_text: tokenized_text = nltk.word_tokenize(sentence) tagged = nltk.pos_tag(tokenized_text) print(tagged)