Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?

With the help of NLTK this can also be done. It can give the base form of the verb. But not the exact tense, but it still can be useful. Try the following code. from nltk.stem.wordnet import WordNetLemmatizer words = [‘gave’,’went’,’going’,’dating’] for word in words: print word+”–>”+WordNetLemmatizer().lemmatize(word,’v’) The output is: gave–>give went–>go going–>go dating–>date Have … Read more

NLTK Tagging spanish words using a corpus

First you need to read the tagged sentence from a corpus. NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml . Then you have to choose your choice of tagger and train … Read more

Computing N Grams using Python

A short Pythonesque solution from this blog: def find_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)]) Usage: >>> input_list = [‘all’, ‘this’, ‘happened’, ‘more’, ‘or’, ‘less’] >>> find_ngrams(input_list, 1) [(‘all’,), (‘this’,), (‘happened’,), (‘more’,), (‘or’,), (‘less’,)] >>> find_ngrams(input_list, 2) [(‘all’, ‘this’), (‘this’, ‘happened’), (‘happened’, ‘more’), (‘more’, ‘or’), (‘or’, ‘less’)] >>> find_ngrams(input_list, 3)) [(‘all’, ‘this’, ‘happened’), (‘this’, … Read more

NLTK Named Entity recognition to a Python list

nltk.ne_chunk returns a nested nltk.tree.Tree object so you would have to traverse the Tree object to get to the NEs. Take a look at Named Entity Recognition with Regular Expression: NLTK >>> from nltk import ne_chunk, pos_tag, word_tokenize >>> from nltk.tree import Tree >>> >>> def get_continuous_chunks(text): … chunked = ne_chunk(pos_tag(word_tokenize(text))) … continuous_chunk = [] … Read more

Opening A large JSON file

You want an incremental json parser like yajl and one of its python bindings. An incremental parser reads as little as possible from the input and invokes a callback when something meaningful is decoded. For example, to pull only numbers from a big json file: class ContentHandler(YajlContentHandler): def yajl_number(self, ctx, val): list_of_numbers.append(float(val)) parser = YajlParser(ContentHandler()) … Read more

Create a custom Transformer in PySpark ML

Can I extend the default one? Not really. Default Tokenizer is a subclass of pyspark.ml.wrapper.JavaTransformer and, same as other transfromers and estimators from pyspark.ml.feature, delegates actual processing to its Scala counterpart. Since you want to use Python you should extend pyspark.ml.pipeline.Transformer directly. import nltk from pyspark import keyword_only ## < 2.0 -> pyspark.ml.util.keyword_only from pyspark.ml … Read more

All synonyms for word in python? [duplicate]

Using wn.synset(‘dog.n.1′).lemma_names is the correct way to access the synonyms of a sense. It’s because a word has many senses and it’s more appropriate to list synonyms of a particular meaning/sense. To enumerate words with similar meanings, possibly you can also look at the hyponyms. Sadly, the size of Wordnet is very limited so there … Read more