How does Apple find dates, times and addresses in emails?

They likely use Information Extraction techniques for this. Here is a demo of Stanford’s SUTime tool: http://nlp.stanford.edu:8080/sutime/process You would extract attributes about n-grams (consecutive words) in a document: numberOfLetters numberOfSymbols length previousWord nextWord nextWordNumberOfSymbols … And then use a classification algorithm, and feed it positive and negative examples: Observation nLetters nSymbols length prevWord nextWord isPartOfDate … Read more

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;). Class weights The weights from the class_weight parameter are used to train the classifier. They are not used … Read more

What is the difference between lemmatization vs stemming?

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope … Read more

Pandas dataframe groupby text value that occurs in two columns

This seems like a graph problem. You could try to use networkx: import networkx as nx G = nx.from_pandas_edgelist(df, ‘v1’, ‘v2’) clusters = nx.connected_components(G) output: [{‘be’, ‘belong’}, {‘delay’, ‘increase’, ‘decrease’}, {‘analyze’, ‘assay’}, {‘report’, ‘bespeak’, ‘circulate’}, {‘induce’, ‘generate’}, {‘trip’, ’cause’}, {‘distinguish’, ‘isolate’}, {‘infect’, ‘give’}, {‘prove’, ‘result’}, {‘intercede’, ‘describe’, ‘explain’}, {‘affect’, ‘expose’}, {‘restrict’, ‘suppress’}] As graph: Small … Read more

Using my own corpus instead of movie_reviews corpus for Classification in NLTK

If you have you data in exactly the same structure as the movie_review corpus in NLTK, there are two ways to “hack” your way through: 1. Put your corpus directory into where you save the nltk.data First check where is your nltk.data saved: >>> import nltk >>> nltk.data.find(‘corpora/movie_reviews’) FileSystemPathPointer(u’/home/alvas/nltk_data/corpora/movie_reviews’) Then move your directory to where … Read more