What are regular expression Balancing Groups?

As far as I know, balancing groups are unique to .NET’s regex flavor.

Aside: Repeated Groups

First, you need to know that .NET is (again, as far as I know) the only regex flavor that lets you access multiple captures of a single capturing group (not in backreferences but after the match has completed).

To illustrate this with an example, consider the pattern

(.)+

and the string "abcd".

in all other regex flavors, capturing group 1 will simply yield one result: d (note, the full match will of course be abcd as expected). This is because every new use of the capturing group overwrites the previous capture.

.NET on the other hand remembers them all. And it does so in a stack. After matching the above regex like

Match m = new Regex(@"(.)+").Match("abcd");

you will find that

m.Groups[1].Captures

Is a CaptureCollection whose elements correspond to the four captures

0: "a"
1: "b"
2: "c"
3: "d"

where the number is the index into the CaptureCollection. So basically every time the group is used again, a new capture is pushed onto the stack.

It gets more interesting if we are using named capturing groups. Because .NET allows repeated use of the same name we could write a regex like

(?<word>\w+)\W+(?<word>\w+)

to capture two words into the same group. Again, every time a group with a certain name is encountered, a capture is pushed onto its stack. So applying this regex to the input "foo bar" and inspecting

m.Groups["word"].Captures

we find two captures

0: "foo"
1: "bar"

This allows us to even push things onto a single stack from different parts of the expression. But still, this is just .NET’s feature of being able to track multiple captures which are listed in this CaptureCollection. But I said, this collection is a stack. So can we pop things from it?

Enter: Balancing Groups

It turns out we can. If we use a group like (?<-word>...), then the last capture is popped from the stack word if the subexpression ... matches. So if we change our previous expression to

(?<word>\w+)\W+(?<-word>\w+)

Then the second group will pop the first group’s capture, and we will receive an empty CaptureCollection in the end. Of course, this example is pretty useless.

But there’s one more detail to the minus-syntax: if the stack is already empty, the group fails (regardless of its subpattern). We can leverage this behavior to count nesting levels – and this is where the name balancing group comes from (and where it gets interesting). Say we want to match strings that are correctly parenthesized. We push each opening parenthesis on the stack, and pop one capture for each closing parenthesis. If we encounter one closing parenthesis too many, it will try to pop an empty stack and cause the pattern to fail:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*$

So we have three alternatives in a repetition. The first alternative consumes everything that is not a parenthesis. The second alternative matches (s while pushing them onto the stack. The third alternative matches )s while popping elements from the stack (if possible!).

_{Note: Just to clarify, we’re only checking that there are no unmatched parentheses! This means that string containing no parentheses at all will match, because they are still syntactically valid (in some syntax where you need your parentheses to match). If you want to ensure at least one set of parentheses, simply add a lookahead (?=.*[(]) right after the ^.}

This pattern is not perfect (or entirely correct) though.

Finale: Conditional Patterns

There is one more catch: this does not ensure that the stack is empty at the end of the string (hence (foo(bar) would be valid). .NET (and many other flavors) have one more construct that helps us out here: conditional patterns. The general syntax is

(?(condition)truePattern|falsePattern)

where the falsePattern is optional – if it is omitted the false-case will always match. The condition can either be a pattern, or the name of a capturing group. I’ll focus on the latter case here. If it’s the name of a capturing group, then truePattern is used if and only if the capture stack for that particular group is not empty. That is, a conditional pattern like (?(name)yes|no) reads “if name has matched and captured something (that is still on the stack), use pattern yes otherwise use pattern no“.

So at the end of our above pattern we could add something like (?(Open)failPattern) which causes the entire pattern to fail, if the Open-stack is not empty. The simplest thing to make the pattern unconditionally fail is (?!) (an empty negative lookahead). So we have our final pattern:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))$

Note that this conditional syntax has nothing per se to do with balancing groups but it’s necessary to harness their full power.

From here, the sky is the limit. Many very sophisticated uses are possible and there are some gotchas when used in combination with other .NET-Regex features like variable-length lookbehinds (which I had to learn the hard way myself). The main question however is always: is your code still maintainable when using these features? You need to document it really well, and be sure that everyone who works on it is also aware of these features. Otherwise you might be better off, just walking the string manually character-by-character and counting nesting levels in an integer.

Addendum: What’s with the `(?<A-B>...)` syntax?

Credits for this part go to Kobi (see his answer below for more details).

Now with all of the above, we can validate that a string is correctly parenthesized. But it would be a lot more useful, if we could actually get (nested) captures for all those parentheses’ contents. Of course, we could remember opening and closing parentheses in a separate capture stack that is not emptied, and then do some substring extraction based on their positions in a separate step.

But .NET provides one more convenience feature here: if we use (?<A-B>subPattern), not only is a capture popped from stack B, but also everything between that popped capture of B and this current group is pushed onto stack A. So if we use a group like this for the closing parentheses, while popping nesting levels from our stack, we can also push the pair’s content onto another stack:

^(?:[^()]|(?<Open>[(])|(?<Content-Open>[)]))*(?(Open)(?!))$

_{Kobi provided this Live-Demo in his answer}

So taking all of these things together we can:

Remember arbitrarily many captures
Validate nested structures
Capture each nesting level

All in a single regular expression. If that’s not exciting… 😉

Some resources that I found helpful when I first learned about them:

http://blog.stevenlevithan.com/archives/balancing-groups
MSDN on balancing groups
MSDN on conditional patterns
http://kobikobi.wordpress.com/tag/balancing-group/ (slightly academic, but has some interesting applications)

Aside: Repeated Groups

Enter: Balancing Groups

Finale: Conditional Patterns

Addendum: What’s with the (?<A-B>...) syntax?

More Related Contents:

Leave a Comment Cancel reply

Addendum: What’s with the `(?<A-B>...)` syntax?