Removing html tags from a string in R

This can be achieved simply through regular expressions and the grep family:

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}

This will also work with multiple html tags in the same string!

This finds any instances of the pattern <.*?> in the htmlString and replaces it with the empty string “”. The ? in .*? makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>) it will match <a> and </a> instead of the whole string.

Leave a Comment