How to tweak the NLTK sentence tokenizer

You need to supply a list of abbreviations to the tokenizer, like so: from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() punkt_param.abbrev_types = set([‘dr’, ‘vs’, ‘mr’, ‘mrs’, ‘prof’, ‘inc’]) sentence_splitter = PunktSentenceTokenizer(punkt_param) text = “is THAT what you mean, Mrs. Hussey?” sentences = sentence_splitter.tokenize(text) sentences is now: [‘is THAT what you mean, Mrs. Hussey?’] Update: … Read more

OpenAI GPT-3 API: How do I make sure answers are from a customized (fine-tuning) dataset?

Semantic search example The following is an example of semantic search based on embeddings using the OpenAI API. Wrong goal: OpenAI API should answer from the fine-tuning dataset if the prompt is similar to the one from the fine-tuning dataset It’s completely wrong logic. Forget about fine-tuning. As stated in the official OpenAI documentation: Fine-tuning … Read more

semantic similarity between sentences

Salma, I’m afraid this is not the right forum for your question as it’s not directly related to programming. I recommend that you ask your question again on corpora list. You also may want to search their archives first. Apart from that, your question is not precise enough, and I’ll explain what I mean by … Read more

Fast n-gram calculation

Since you didn’t indicate whether you want word or character-level n-grams, I’m just going to assume the former, without loss of generality. I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself. def ngrams(tokens, MIN_N, MAX_N): n_tokens = len(tokens) for i in … Read more

Implementing Bag-of-Words Naive-Bayes classifier in NLTK

scikit-learn has an implementation of multinomial naive Bayes, which is the right variant of naive Bayes in this situation. A support vector machine (SVM) would probably work better, though. As Ken pointed out in the comments, NLTK has a nice wrapper for scikit-learn classifiers. Modified from the docs, here’s a somewhat complicated one that does … Read more

What is NLTK POS tagger asking me to download?

From NLTK versions higher than v3.2, please use: >>> import nltk >>> nltk.__version__ ‘3.2.1’ >>> nltk.download(‘averaged_perceptron_tagger’) [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /home/alvas/nltk_data… [nltk_data] Package averaged_perceptron_tagger is already up-to-date! True For NLTK versions using the old MaxEnt model, i.e. v3.1 and below, please use: >>> import nltk >>> nltk.download(‘maxent_treebank_pos_tagger’) [nltk_data] Downloading package maxent_treebank_pos_tagger to [nltk_data] … Read more

NLTK WordNet Lemmatizer: Shouldn’t it lemmatize all inflections of a word?

The WordNet lemmatizer does take the POS tag into account, but it doesn’t magically determine it: >>> nltk.stem.WordNetLemmatizer().lemmatize(‘loving’) ‘loving’ >>> nltk.stem.WordNetLemmatizer().lemmatize(‘loving’, ‘v’) u’love’ Without a POS tag, it assumes everything you feed it is a noun. So here it thinks you’re passing it the noun “loving” (as in “sweet loving”).