How can I find compound words, removing spaces between them and replace them in my corpus?

If all your compound terms are separated only by blanks, you can use gsub:

> x = c("hello World", "good Morning", "good Night")
> y = gsub(pattern = " ", replacement = "", x = x)
> print(y)
[1] "helloWorld"  "goodMorning" "goodNight"  

You can always add more patterns to pattern argument. Read more about regular expression in R here and here.

Edit

@user4241750: True, but I only want to do this for particular compound
terms(There are many) not all the terms in the corpus since there are
many other terms in the corpus

If you know all particular compound terms you want to change, you can specify it on docs[[j]]. Say the only terms you want to change are “simple parts” and “good morning”:

terms.to.change = c("simple parts","good morning")
for (j in seq(corpus)) {
  positions.to.change = which(docs[[j]] %in% terms.to.change)
  docs[[j]][positions.to.change] <- gsub(" ", "", docs[[j]][positions.to.change])
}

Leave a Comment