Ordinal numbers replacement

Here’s a terse solution taken from Gareth on codegolf: ordinal = lambda n: “%d%s” % (n,”tsnrhtdd”[(n//10%10!=1)*(n%10<4)*n%10::4]) Works on any number: print([ordinal(n) for n in range(1,32)]) [‘1st’, ‘2nd’, ‘3rd’, ‘4th’, ‘5th’, ‘6th’, ‘7th’, ‘8th’, ‘9th’, ’10th’, ’11th’, ’12th’, ’13th’, ’14th’, ’15th’, ’16th’, ’17th’, ’18th’, ’19th’, ’20th’, ’21st’, ’22nd’, ’23rd’, ’24th’, ’25th’, ’26th’, ’27th’, ’28th’, ’29th’, ’30th’, … Read more

Stopword removal with NLTK

There is an in-built stopword list in NLTK made up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.html >>> from nltk import word_tokenize >>> from nltk.corpus import stopwords >>> stop = set(stopwords.words(‘english’)) >>> sentence = “this is a foo bar sentence” >>> print([i for i in sentence.lower().split() if i not in stop]) … Read more

Python NLTK pos_tag not returning the correct part-of-speech tag

In short: NLTK is not perfect. In fact, no model is perfect. Note: As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle. It is now the perceptron tagger from @Honnibal’s implementation, see nltk.tag.pos_tag >>> import inspect >>> print inspect.getsource(pos_tag) def pos_tag(tokens, tagset=None): tagger = PerceptronTagger() return _pos_tag(tokens, tagset, … Read more

Creating a new corpus with NLTK

After some years of figuring out how it works, here’s the updated tutorial of How to create an NLTK corpus with a directory of textfiles? The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it’s best to use the PlaintextCorpusReader. If … Read more

How do I download NLTK data?

TL;DR To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use: $ python3 >>> import nltk >>> nltk.download(‘punkt’) If you’re unsure of which data/model you need, you can start out with the basic list of data + models with: >>> import nltk >>> nltk.download(‘popular’) … Read more

How to check if a word is an English word with Python?

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There’s a tutorial, or you could just dive straight in: >>> import enchant >>> d = enchant.Dict(“en_US”) >>> d.check(“Hello”) True >>> d.check(“Helo”) False >>> d.suggest(“Helo”) [‘He lo’, ‘He-lo’, ‘Hello’, ‘Helot’, ‘Help’, ‘Halo’, ‘Hell’, ‘Held’, ‘Helm’, ‘Hero’, “He’ll”] >>> PyEnchant comes with a … Read more

How to use Stanford Parser in NLTK using Python

Note that this answer applies to NLTK v 3.0, and not to more recent versions. Sure, try the following in Python: import os from nltk.parse import stanford os.environ[‘STANFORD_PARSER’] = ‘/path/to/standford/jars’ os.environ[‘STANFORD_MODELS’] = ‘/path/to/standford/jars’ parser = stanford.StanfordParser(model_path=”/location/of/the/englishPCFG.ser.gz”) sentences = parser.raw_parse_sents((“Hello, My name is Melroy.”, “What is your name?”)) print sentences # GUI for line in sentences: … Read more