nlp - w3toppers.com

Python NLTK pos_tag not returning the correct part-of-speech tag

In short: NLTK is not perfect. In fact, no model is perfect. Note: As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle. It is now the perceptron tagger from @Honnibal’s implementation, see nltk.tag.pos_tag >>> import inspect >>> print inspect.getsource(pos_tag) def pos_tag(tokens, tagset=None): tagger = PerceptronTagger() return _pos_tag(tokens, tagset, … Read more

Creating a new corpus with NLTK

After some years of figuring out how it works, here’s the updated tutorial of How to create an NLTK corpus with a directory of textfiles? The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it’s best to use the PlaintextCorpusReader. If … Read more

How does the Google “Did you mean?” Algorithm work? [closed]

Here’s the explanation directly from the source ( almost ) Search 101! at min 22:03 Worth watching! Basically and according to Douglas Merrill former CTO of Google it is like this: 1) You write a ( misspelled ) word in google 2) You don’t find what you wanted ( don’t click on any results ) … Read more

How to compute the similarity between two text documents?

The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online. Computing Pairwise Similarities TF-IDF (and similar text transformations) are implemented in the Python … Read more

Calculate cosine similarity given 2 sentence strings

A simple pure-Python implementation would be: import math import re from collections import Counter WORD = re.compile(r”\w+”) def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())]) sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())]) denominator = math.sqrt(sum1) … Read more

How to use Stanford Parser in NLTK using Python

Note that this answer applies to NLTK v 3.0, and not to more recent versions. Sure, try the following in Python: import os from nltk.parse import stanford os.environ[‘STANFORD_PARSER’] = ‘/path/to/standford/jars’ os.environ[‘STANFORD_MODELS’] = ‘/path/to/standford/jars’ parser = stanford.StanfordParser(model_path=”/location/of/the/englishPCFG.ser.gz”) sentences = parser.raw_parse_sents((“Hello, My name is Melroy.”, “What is your name?”)) print sentences # GUI for line in sentences: … Read more

feature extraction in python for nlp

import re test_string = “The New York Police, New Delhi police and other police departments are fighting crime”.lower() cities = [‘new delhi’, ‘new york’] regex = “((“+”|”.join(cities)+”) police)” regex = regex.lower() results = re.findall(regex, test_string) print([res[0] for res in results]) #[‘new york police’, ‘new delhi police’]

how to count average sentence length (in words) from a text file contains 100 sentences using python [closed]

The naive way: sents = text.split(‘.’) avg_len = sum(len(x.split()) for x in sents) / len(sents) The serious way: use nltk to tokenize the text according to the target language rules.

Speeker Identification (Machine Learning)

There are a lot of resources available, books, journals, sources, programs. But let me suggest very simple algorithm. 1. Train/Save HMM acoustic model for each speaker. If u have 10 speakers, than u have ten models. Now ask a current speaker to say magic words like ‘Ok google’. Now recognize it against every models saved. … Read more

How to remove English words from a file containing Dari words?

You could install and use the nltk library. This provides you with a list of English words and a means to split each line into words: from nltk.tokenize import word_tokenize from nltk.corpus import words english = words.words() with open(‘Dari.pos’) as f_input, open(‘DariNER.txt’, ‘w’) as f_output: for line in f_input: f_output.write(‘ ‘.join(word for word in word_tokenize(line) … Read more