How to split String with some separator but without removing that separator in Java? [duplicate]

string1.split("(?=-)");

This works because split actually takes a regular expression. What you’re actually seeing is a “zero-width positive lookahead”.

I would love to explain more but my daughter wants to play tea party. 🙂

Edit: Back!

To explain this, I will first show you a different split operation:

"Ram-sita-laxman".split("");

This splits your string on every zero-length string. There is a zero-length string between every character. Therefore, the result is:

["", "R", "a", "m", "-", "s", "i", "t", "a", "-", "l", "a", "x", "m", "a", "n"]

Now, I modify my regular expression ("") to only match zero-length strings if they are followed by a dash.

"Ram-sita-laxman".split("(?=-)");
["Ram", "-sita", "-laxman"]

In that example, the ?= means “lookahead”. More specifically, it mean “positive lookahead”. Why the “positive”? Because you can also have negative lookahead (?!) which will split on every zero-length string that is not followed by a dash:

"Ram-sita-laxman".split("(?!-)");
["", "R", "a", "m-", "s", "i", "t", "a-", "l", "a", "x", "m", "a", "n"]

You can also have positive lookbehind (?<=) which will split on every zero-length string that is preceded by a dash:

"Ram-sita-laxman".split("(?<=-)");
["Ram-", "sita-", "laxman"]

Finally, you can also have negative lookbehind (?<!) which will split on every zero-length string that is not preceded by a dash:

"Ram-sita-laxman".split("(?<!-)");
["", "R", "a", "m", "-s", "i", "t", "a", "-l", "a", "x", "m", "a", "n"]

These four expressions are collectively known as the lookaround expressions.

Bonus: Putting them together

I just wanted to show an example I encountered recently that combines two of the lookaround expressions. Suppose you wish to split a CapitalCase identifier up into its tokens:

"MyAwesomeClass" => ["My", "Awesome", "Class"]

You can accomplish this using this regular expression:

"MyAwesomeClass".split("(?<=[a-z])(?=[A-Z])");

This splits on every zero-length string that is preceded by a lower case letter ((?<=[a-z])) and followed by an upper case letter ((?=[A-Z])).

This technique also works with camelCase identifiers.

Leave a Comment