Python NLTK pos_tag not returning the correct part-of-speech tag

In short: NLTK is not perfect. In fact, no model is perfect. Note: As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle. It is now the perceptron tagger from @Honnibal’s implementation, see nltk.tag.pos_tag >>> import inspect >>> print inspect.getsource(pos_tag) def pos_tag(tokens, tagset=None): tagger = PerceptronTagger() return _pos_tag(tokens, tagset, … Read more

Creating a new corpus with NLTK

After some years of figuring out how it works, here’s the updated tutorial of How to create an NLTK corpus with a directory of textfiles? The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it’s best to use the PlaintextCorpusReader. If … Read more

How to compute the similarity between two text documents?

The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online. Computing Pairwise Similarities TF-IDF (and similar text transformations) are implemented in the Python … Read more

Calculate cosine similarity given 2 sentence strings

A simple pure-Python implementation would be: import math import re from collections import Counter WORD = re.compile(r”\w+”) def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())]) sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())]) denominator = math.sqrt(sum1) … Read more

How to use Stanford Parser in NLTK using Python

Note that this answer applies to NLTK v 3.0, and not to more recent versions. Sure, try the following in Python: import os from nltk.parse import stanford os.environ[‘STANFORD_PARSER’] = ‘/path/to/standford/jars’ os.environ[‘STANFORD_MODELS’] = ‘/path/to/standford/jars’ parser = stanford.StanfordParser(model_path=”/location/of/the/englishPCFG.ser.gz”) sentences = parser.raw_parse_sents((“Hello, My name is Melroy.”, “What is your name?”)) print sentences # GUI for line in sentences: … Read more

feature extraction in python for nlp

import re test_string = “The New York Police, New Delhi police and other police departments are fighting crime”.lower() cities = [‘new delhi’, ‘new york’] regex = “((“+”|”.join(cities)+”) police)” regex = regex.lower() results = re.findall(regex, test_string) print([res[0] for res in results]) #[‘new york police’, ‘new delhi police’]

How to remove English words from a file containing Dari words?

You could install and use the nltk library. This provides you with a list of English words and a means to split each line into words: from nltk.tokenize import word_tokenize from nltk.corpus import words english = words.words() with open(‘Dari.pos’) as f_input, open(‘DariNER.txt’, ‘w’) as f_output: for line in f_input: f_output.write(‘ ‘.join(word for word in word_tokenize(line) … Read more