nltk - w3toppers.com

Ordinal numbers replacement

Here’s a terse solution taken from Gareth on codegolf: ordinal = lambda n: “%d%s” % (n,”tsnrhtdd”[(n//10%10!=1)*(n%10<4)*n%10::4]) Works on any number: print([ordinal(n) for n in range(1,32)]) [‘1st’, ‘2nd’, ‘3rd’, ‘4th’, ‘5th’, ‘6th’, ‘7th’, ‘8th’, ‘9th’, ’10th’, ’11th’, ’12th’, ’13th’, ’14th’, ’15th’, ’16th’, ’17th’, ’18th’, ’19th’, ’20th’, ’21st’, ’22nd’, ’23rd’, ’24th’, ’25th’, ’26th’, ’27th’, ’28th’, ’29th’, ’30th’, … Read more

Stopword removal with NLTK

There is an in-built stopword list in NLTK made up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.html >>> from nltk import word_tokenize >>> from nltk.corpus import stopwords >>> stop = set(stopwords.words(‘english’)) >>> sentence = “this is a foo bar sentence” >>> print([i for i in sentence.lower().split() if i not in stop]) … Read more

Python NLTK pos_tag not returning the correct part-of-speech tag

In short: NLTK is not perfect. In fact, no model is perfect. Note: As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle. It is now the perceptron tagger from @Honnibal’s implementation, see nltk.tag.pos_tag >>> import inspect >>> print inspect.getsource(pos_tag) def pos_tag(tokens, tagset=None): tagger = PerceptronTagger() return _pos_tag(tokens, tagset, … Read more

Creating a new corpus with NLTK

After some years of figuring out how it works, here’s the updated tutorial of How to create an NLTK corpus with a directory of textfiles? The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it’s best to use the PlaintextCorpusReader. If … Read more

How to remove stop words using nltk or python

from nltk.corpus import stopwords # … filtered_words = [word for word in word_list if word not in stopwords.words(‘english’)]

How do I download NLTK data?

TL;DR To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use: $ python3 >>> import nltk >>> nltk.download(‘punkt’) If you’re unsure of which data/model you need, you can start out with the basic list of data + models with: >>> import nltk >>> nltk.download(‘popular’) … Read more

How to check if a word is an English word with Python?

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There’s a tutorial, or you could just dive straight in: >>> import enchant >>> d = enchant.Dict(“en_US”) >>> d.check(“Hello”) True >>> d.check(“Helo”) False >>> d.suggest(“Helo”) [‘He lo’, ‘He-lo’, ‘Hello’, ‘Helot’, ‘Help’, ‘Halo’, ‘Hell’, ‘Held’, ‘Helm’, ‘Hero’, “He’ll”] >>> PyEnchant comes with a … Read more

Refering to a directory in a Flask app doesn’t work unless the path is absolute

In Python (and most languages), where the code resides in a package is different than what the working directory is when running a program. All relative paths are relative to the current working directory, not the code file it’s written in. So you would use the relative path nltk_data/ even from a blueprint, or you … Read more

How to use Stanford Parser in NLTK using Python

Note that this answer applies to NLTK v 3.0, and not to more recent versions. Sure, try the following in Python: import os from nltk.parse import stanford os.environ[‘STANFORD_PARSER’] = ‘/path/to/standford/jars’ os.environ[‘STANFORD_MODELS’] = ‘/path/to/standford/jars’ parser = stanford.StanfordParser(model_path=”/location/of/the/englishPCFG.ser.gz”) sentences = parser.raw_parse_sents((“Hello, My name is Melroy.”, “What is your name?”)) print sentences # GUI for line in sentences: … Read more

Keep getting some permission denied error

You are probably doing python setup.py install something or pip install something and its trying to install to the global Python package location, for which your user does not have access. You need to use virtual environments.