Remove βœ…, πŸ”₯, ✈ , β™› and other such emojis/images/signs from Java strings

Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don’t need to worry about every new emoji being added.

String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter,"");

So:

  • [\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s] is a range representing all numeric (\\p{N}), letter (\\p{L}), mark (\\p{M}), punctuation (\\p{P}), whitespace/separator (\\p{Z}), other formatting (\\p{Cf}) and other characters above U+FFFF in Unicode (\\p{Cs}), and newline (\\s) characters. \\p{L} specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.
  • The ^ in the regex character set negates the match.

Example:

String str = "hello world _# ηš†γ•γ‚“γ€γ“γ‚“γ«γ‘γ―οΌγ€€η§γ―γ‚Έγƒ§γƒ³γ¨η”³γ—γΎγ™γ€‚πŸ”₯";
System.out.print(str.replaceAll("[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]",""));
// Output:
//   "hello world _# ηš†γ•γ‚“γ€γ“γ‚“γ«γ‘γ―οΌγ€€η§γ―γ‚Έγƒ§γƒ³γ¨η”³γ—γΎγ™γ€‚"

If you need more information, check out the Java documentation for regexes.

Leave a Comment