text-mining
Extracting dates that are in different formats using regex and sorting them – pandas
I think this is one of the coursera text mining assignment. Well you can use regex and extract to get the solution. dates.txt i.e doc = [] with open(‘dates.txt’) as file: for line in file: doc.append(line) df = pd.Series(doc) def date_sorter(): # Get the dates in the form of words one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})’) # Get … Read more
bigrams instead of single words in termdocument matrix using R and Rweka
Inspired by Anthony’s comment, I found out that you can specify the number of threads that the parallel library uses by default (specify it before you call the NgramTokenizer): # Sets the default number of threads to use options(mc.cores=1) Since the NGramTokenizer seems to hang on the parallel::mclapply call, changing the number of threads seems … Read more
Detect text language in R
The textcat package does this. It can detect 74 ‘languages’ (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article: Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, … Read more
R tm package invalid input in ‘utf8towcs’
None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html). The code is this simple usableText=str_replace_all(tweets$text,”[^[:graph:]]”, ” “)
Finding 2 & 3 word Phrases Using R TM Package
You can pass in a custom tokenizing function to tm‘s DocumentTermMatrix function, so if you have package tau installed it’s fairly straightforward. library(tm); library(tau); tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method=”string”,n=n))))) texts <- c(“This is the first document.”, “This is the second file.”, “This is the third text.”) corpus <- Corpus(VectorSource(texts)) matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams)) Where n in … Read more