How to get rid of punctuation using NLTK tokenizer?

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

More Related Contents:

How do I tokenize a string sentence in NLTK?
How to use Stanford Parser in NLTK using Python
Creating a new corpus with NLTK
Python NLTK pos_tag not returning the correct part-of-speech tag
Stopword removal with NLTK
Ordinal numbers replacement
How to config nltk data directory from code?
Classification using movie review corpus in NLTK/Python
Convert words between verb/noun/adjective forms
NLTK Named Entity recognition to a Python list
Computing N Grams using Python
Python – RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]
Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?
English grammar for parsing in NLTK
NLTK and language detection
How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?
Fast/Optimize N-gram implementations in python
Extract Word from Synset using Wordnet in NLTK 3.0
How do I do dependency parsing in NLTK?
training data format for NLTK punkt
Difference between Python’s collections.Counter and nltk.probability.FreqDist
Saving nltk drawn parse tree to image file
Creating a custom categorized corpus in NLTK and Python
tag generation from a text content
What is NLTK POS tagger asking me to download?
Implementing Bag-of-Words Naive-Bayes classifier in NLTK
Fast n-gram calculation
Using the Python NLTK (2.0b5) on the Google App Engine
How to use malt parser in python nltk
How to tweak the NLTK sentence tokenizer

More Related Contents:

Leave a Comment Cancel reply