nltk - w3toppers.com

re.sub erroring with “Expected string or bytes-like object”

As you stated in the comments, some of the values appeared to be floats, not strings. You will need to change it to strings before passing it to re.sub. The simplest way is to change location to str(location) when using re.sub. It wouldn’t hurt to do it anyways even if it’s already a str. letters_only … Read more

nltk NaiveBayesClassifier training for sentiment analysis

You need to change your data structure. Here is your train list as it currently stands: >>> train = [(‘I love this sandwich.’, ‘pos’), (‘This is an amazing place!’, ‘pos’), (‘I feel very good about these beers.’, ‘pos’), (‘This is my best work.’, ‘pos’), (“What an awesome view”, ‘pos’), (‘I do not like this restaurant’, … Read more

Python: tf-idf-cosine: to find document similarity

First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer: >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314×130088 sparse matrix of type ‘<type ‘numpy.float64′>’ with 1787553 … Read more

wordnet lemmatization and pos tagging in python

First of all, you can use nltk.pos_tag() directly without training it. The function will load a pretrained tagger from a file. You can see the file name with nltk.tag._POS_TAGGER: nltk.tag._POS_TAGGER >>> ‘taggers/maxent_treebank_pos_tagger/english.pickle’ As it was trained with the Treebank corpus, it also uses the Treebank tag set. The following function would map the treebank tags … Read more

Convert words between verb/noun/adjective forms

This is more a heuristic approach. I have just coded it so appologies for the style. It uses the derivationally_related_forms() from wordnet. I have implemented nounify. I guess verbify works analogous. From what I’ve tested works pretty well: from nltk.corpus import wordnet as wn def nounify(verb_word): “”” Transform a verb to the closest noun: die … Read more

n-grams in python, four, five, six grams?

Great native python based answers given by other users. But here’s the nltk approach (just in case, the OP gets penalized for reinventing what’s already existing in the nltk library). There is an ngram module that people seldom use in nltk. It’s not because it’s hard to read ngrams, but training a model base on … Read more

Classification using movie review corpus in NLTK/Python

Yes, the tutorial on chapter 6 is aim for a basic knowledge for students and from there, the students should build on it by exploring what’s available in NLTK and what’s not. So let’s go through the problems one at a time. Firstly, the way to get ‘pos”https://stackoverflow.com/”neg’ documents through the directory is most probably … Read more

pip issue installing almost any library

I found it sufficient to specify the pypi host as trusted. Example: pip install –trusted-host pypi.python.org pytest-xdist pip install –trusted-host pypi.python.org –upgrade pip This solved the following error: Could not fetch URL https://pypi.python.org/simple/pytest-cov/: There was a problem confirming the ssl certificate: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:600) – skipping Could not find a version that … Read more

How to config nltk data directory from code?

Just change items of nltk.data.path, it’s a simple list.

Why is my NLTK function slow when processing the DataFrame?

Your original nlkt() loops through each row 3 times. def nlkt(val): val=repr(val) clean_txt = [word for word in val.split() if word.lower() not in stopwords.words(‘english’)] nopunc = [char for char in str(clean_txt) if char not in string.punctuation] nonum = [char for char in nopunc if not char.isdigit()] words_string = ”.join(nonum) return words_string Also, each time you’re … Read more