nltk - w3toppers.com

Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?

With the help of NLTK this can also be done. It can give the base form of the verb. But not the exact tense, but it still can be useful. Try the following code. from nltk.stem.wordnet import WordNetLemmatizer words = [‘gave’,’went’,’going’,’dating’] for word in words: print word+”–>”+WordNetLemmatizer().lemmatize(word,’v’) The output is: gave–>give went–>go going–>go dating–>date Have … Read more

NLTK Tagging spanish words using a corpus

First you need to read the tagged sentence from a corpus. NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml . Then you have to choose your choice of tagger and train … Read more

Computing N Grams using Python

A short Pythonesque solution from this blog: def find_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)]) Usage: >>> input_list = [‘all’, ‘this’, ‘happened’, ‘more’, ‘or’, ‘less’] >>> find_ngrams(input_list, 1) [(‘all’,), (‘this’,), (‘happened’,), (‘more’,), (‘or’,), (‘less’,)] >>> find_ngrams(input_list, 2) [(‘all’, ‘this’), (‘this’, ‘happened’), (‘happened’, ‘more’), (‘more’, ‘or’), (‘or’, ‘less’)] >>> find_ngrams(input_list, 3)) [(‘all’, ‘this’, ‘happened’), (‘this’, … Read more

NLTK-based text processing with pandas

Your function is slow and is incomplete. First, with the issues – You’re not lowercasing your data. You’re not getting rid of digits and punctuation properly. You’re not returning a string (you should join the list using str.join and return it) Furthermore, a list comprehension with text processing is a prime way to introduce readability … Read more

NLTK Named Entity recognition to a Python list

nltk.ne_chunk returns a nested nltk.tree.Tree object so you would have to traverse the Tree object to get to the NEs. Take a look at Named Entity Recognition with Regular Expression: NLTK >>> from nltk import ne_chunk, pos_tag, word_tokenize >>> from nltk.tree import Tree >>> >>> def get_continuous_chunks(text): … chunked = ne_chunk(pos_tag(word_tokenize(text))) … continuous_chunk = [] … Read more

What is “entropy and information gain”?

I assume entropy was mentioned in the context of building decision trees. To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either m or f, we want to learn a model that fits the data and can be used to predict … Read more

Opening A large JSON file

You want an incremental json parser like yajl and one of its python bindings. An incremental parser reads as little as possible from the input and invokes a callback when something meaningful is decoded. For example, to pull only numbers from a big json file: class ContentHandler(YajlContentHandler): def yajl_number(self, ctx, val): list_of_numbers.append(float(val)) parser = YajlParser(ContentHandler()) … Read more

Create a custom Transformer in PySpark ML

Can I extend the default one? Not really. Default Tokenizer is a subclass of pyspark.ml.wrapper.JavaTransformer and, same as other transfromers and estimators from pyspark.ml.feature, delegates actual processing to its Scala counterpart. Since you want to use Python you should extend pyspark.ml.pipeline.Transformer directly. import nltk from pyspark import keyword_only ## < 2.0 -> pyspark.ml.util.keyword_only from pyspark.ml … Read more

Failed loading english.pickle with nltk.data.load

I had this same problem. Go into a python shell and type: >>> import nltk >>> nltk.download() Then an installation window appears. Go to the ‘Models’ tab and select ‘punkt’ from under the ‘Identifier’ column. Then click Download and it will install the necessary files. Then it should work!

All synonyms for word in python? [duplicate]

Using wn.synset(‘dog.n.1′).lemma_names is the correct way to access the synonyms of a sense. It’s because a word has many senses and it’s more appropriate to list synonyms of a particular meaning/sense. To enumerate words with similar meanings, possibly you can also look at the hyponyms. Sadly, the size of Wordnet is very limited so there … Read more