How to remove unicode from string?

I just want to remove unicode <U+00A6> which is at the beginning of string.

Then you do not need a gsub, you can use a sub with "^\\s*<U\\+\\w+>\\s*" pattern:

q <-"<U+00A6>  1000-66329"
sub("^\\s*<U\\+\\w+>\\s*", "", q)

Pattern details:

  • ^ – start of string
  • \\s* – zero or more whitespaces
  • <U\\+ – a literal char sequence <U+
  • \\w+ – 1 or more letters, digits or underscores
  • > – a literal >
  • \\s* – zero or more whitespaces.

If you also need to replace the - with a space, add |- alternative and use gsub (since now we expect several replacements and the replacement must be a space – same is in akrun’s answer):

trimws(gsub("^\\s*<U\\+\\w+>|-", " ", q))

See the R online demo

Leave a Comment