OpenAI GPT-3 API: How do I make sure answers are from a customized (fine-tuning) dataset?

Semantic search example The following is an example of semantic search based on embeddings using the OpenAI API. Wrong goal: OpenAI API should answer from the fine-tuning dataset if the prompt is similar to the one from the fine-tuning dataset It’s completely wrong logic. Forget about fine-tuning. As stated in the official OpenAI documentation: Fine-tuning … Read more

What is the difference between lemmatization vs stemming?

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope … Read more

Anyone know of some good Word Sense Disambiguation software? [closed]

My list are not exhaustive but surely Googling for more will be better for your purposes. For softwares here’s a short list, remember to CITE the relevant sources!!! GWSD: Unsupervised Graph-based Word Sense Disambiguation http://lit.csci.unt.edu/~rada/downloads/GWSD/GWSD.1.0.tar.gz SenseLearner: All-Words Word Sense Disambiguation Tool http://lit.csci.unt.edu/~rada/downloads/senselearner/SenseLearner2.0.tar.gz KYOTO UKB graph-based WSD http://ixa2.si.ehu.es/ukb/ pyWSD: Python Implementation of Simple WSD algorithms https://github.com/alvations/pywsd … Read more

How do you implement a “Did you mean”? [duplicate]

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don’t do anything like check against a dictionary, but rather they make use of statistics to identify “similar” queries that returned more results than your query, the exact algorithm is of course not known. There are different sub-problems to solve here, … Read more

Stemmers vs Lemmatizers

Q1: “[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English” Yes. Stemmers are much simpler, smaller and usually faster than lemmatizers, and for many applications their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction … Read more

Detecting syllables in a word

Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang’s thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.

What are some simple NLP projects that a CS undergrad can try implementing? [closed]

There are plenty of them. Here is a list of different NLP problems: spam detection text genre categorization (news, fiction, science paper) finding similar texts (for example search for similar articles) find something about author (genre, native-speaker/non-native-speaker) create automatic grader for student’s work check text for plagiarism create an application that looks for grammatical errors … Read more