nltk - w3toppers.com

Extract Word from Synset using Wordnet in NLTK 3.0

WordNet works fine in NLTK 3.0. You are just accessing the lemmas (and names) in the wrong way. Try this instead: >>> import nltk >>> nltk.__version__ ‘3.0.0’ >>> from nltk.corpus import wordnet as wn >>> for synset in wn.synsets(‘dog’): for lemma in synset.lemmas(): print lemma.name() dog domestic_dog Canis_familiaris frump dog dog cad bounder blackguard … … Read more

Fast/Optimize N-gram implementations in python

Some attempts with some profiling. I thought using generators could improve the speed here. But the improvement was not noticeable compared to a slight modification of the original. But if you don’t need the full list at the same time, the generator functions should be faster. import timeit from itertools import tee, izip, islice def … Read more

How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?

In short: df[‘Text’].apply(word_tokenize) Or if you want to add another column to store the tokenized list of strings: df[‘tokenized_text’] = df[‘Text’].apply(word_tokenize) There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual To use nltk.tokenize.TweetTokenizer: from nltk.tokenize import TweetTokenizer tt = TweetTokenizer() df[‘Text’].apply(tt.tokenize) Similar to: How to apply pos_tag_sents() to pandas dataframe efficiently how to use … Read more

How to get synonyms from nltk WordNet Python

If you want the synonyms in the synset (aka the lemmas that make up the set), you can get them with lemma_names(): >>> for ss in wn.synsets(‘small’): >>> print(ss.name(), ss.lemma_names()) small.n.01 [‘small’] small.n.02 [‘small’] small.a.01 [‘small’, ‘little’] minor.s.10 [‘minor’, ‘modest’, ‘small’, ‘small-scale’, ‘pocket-size’, ‘pocket-sized’] little.s.03 [‘little’, ‘small’] small.s.04 [‘small’] humble.s.01 [‘humble’, ‘low’, ‘lowly’, ‘modest’, ‘small’] … Read more

NLTK and language detection

Have you come across the following code snippet? english_vocab = set(w.lower() for w in nltk.corpus.words.words()) text_vocab = set(w.lower() for w in text if w.lower().isalpha()) unusual = text_vocab.difference(english_vocab) from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active Or the following demo file? https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py

English grammar for parsing in NLTK

You can take a look at pyStatParser, a simple python statistical parser that returns NLTK parse Trees. It comes with public treebanks and it generates the grammar model only the first time you instantiate a Parser object (in about 8 seconds). It uses a CKY algorithm and it parses average length sentences (like the one … Read more

str.translate gives TypeError – Translate takes one argument (2 given), worked in Python 2

If all you are looking to accomplish is to do the same thing you were doing in Python 2 in Python 3, here is what I was doing in Python 2.0 to throw away punctuation and numbers: text = text.translate(None, string.punctuation) text = text.translate(None, ‘1234567890’) Here is my Python 3.0 equivalent: text = text.translate(str.maketrans(”,”,string.punctuation)) text … Read more

How to get rid of punctuation using NLTK tokenizer?

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else: from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r’\w+’) tokenizer.tokenize(‘Eighty-seven miles to go, yet. Onward!’) Output: [‘Eighty’, ‘seven’, ‘miles’, ‘to’, ‘go’, ‘yet’, ‘Onward’]

nltk doesn’t add $NLTK_DATA to search path?

If you don’t want to set the $NLTK_DATA before running your scripts, you can do it within the python scripts with: import nltk nltk.path.append(‘/home/alvas/some_path/nltk_data/’) E.g. let’s move the the nltk_data to a non-standard path that NLTK won’t find it automatically: alvas@ubi:~$ ls nltk_data/ chunkers corpora grammars help misc models stemmers taggers tokenizers alvas@ubi:~$ mkdir some_path … Read more

How do I tokenize a string sentence in NLTK?

This is actually on the main page of nltk.org: >>> import nltk >>> sentence = “””At eight o’clock on Thursday morning … Arthur didn’t feel very good.””” >>> tokens = nltk.word_tokenize(sentence) >>> tokens [‘At’, ‘eight’, “o’clock”, ‘on’, ‘Thursday’, ‘morning’, ‘Arthur’, ‘did’, “n’t”, ‘feel’, ‘very’, ‘good’, ‘.’]