Issue with below snippet on boundary matchers regex (\b)

You need to create an alternation group out of the set with

String.join("|", toDelete)

and use as

line = line.replaceAll("\\b(?:"+String.join("|", toDelete)+")\\b", "");

The pattern will look like

\b(?:end|something)\b

See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).

Or, better, compile the regex before entering the loop:

Pattern pat = Pattern.compile("\\b(?:" + String.join("|", toDelete) + ")\\b");
...
    line = pat.matcher(line).replaceAll("");

UPDATE:

To allow matching whole “words” that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!\w) instead of the initial \b to make sure there is no word char before and (?!\w) negative lookahead instead of the final \b to make sure there is no word char after the match.

In Java 8, you may use this code:

Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
    .map(Pattern::quote)
    .collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\\w)(?:" + String.join("|", nToDel) + ")(?!\\w)";

The regex will look like (?<!\w)(?:\Q+end\E|\Qsomething-\E)(?!\w). Note that the symbols between \Q and \E is parsed as literal symbols.

Leave a Comment