Extracting dates that are in different formats using regex and sorting them – pandas

I think this is one of the coursera text mining assignment. Well you can use regex and extract to get the solution. dates.txt i.e doc = [] with open(‘dates.txt’) as file: for line in file: doc.append(line) df = pd.Series(doc) def date_sorter(): # Get the dates in the form of words one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})’) # Get … Read more

Detect text language in R

The textcat package does this. It can detect 74 ‘languages’ (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article: Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, … Read more

Finding 2 & 3 word Phrases Using R TM Package

You can pass in a custom tokenizing function to tm‘s DocumentTermMatrix function, so if you have package tau installed it’s fairly straightforward. library(tm); library(tau); tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method=”string”,n=n))))) texts <- c(“This is the first document.”, “This is the second file.”, “This is the third text.”) corpus <- Corpus(VectorSource(texts)) matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams)) Where n in … Read more