Collapse and Capture a Repeating Pattern in a Single Regex Expression

Read this first!

This post is to show the possibility rather than endorsing the “everything regex” approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.

For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.

This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.

Summary

  • .NET supports capturing repeating pattern with CaptureCollection class.
  • For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
  • For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).

Solution

This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.

However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.

For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:

(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)

DEMO. The construction is \w+- repeated, then \w+:end.

(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)

DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.

\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.

Explanation

Let us break down the regex:

(?:
  start:(?=\w+(?:-\w+){2,9}:end)
    |
  (?<=-)\G
)
(\w+)
(?:-|:end)

The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.

The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.

I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.

As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.

For the other construction:

(?:
  start:(?=\w+(?:-\w+){2,9}:end)
    |
  (?!^)\G-
)
(\w+)

Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).

It is quite tricky to make this regex works for matching in middle of the string:

  • We need to check the number of words between start: and :end (as part of the requirement of the original regex).

  • \G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn’t any start:.

    For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).

  • For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don’t match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.

    The second construction doesn’t run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.

Validation Version

If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:

(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)

(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).

Construction

For all the problems where you want to capture all instances of a repetition, I don’t think there exists a general way to modify the regex. One example of a “hard” (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.

When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.

We build such regex by going through these steps:

  • Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
  • Match and capture the first instance. (e.g. (\w+))
    (At this point, the first instance and delimiter should have been matched)
  • Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
  • Add the delimiter (if any). (e.g. -)
    (After this step, the rest of the tokens should have also been matched, except the last maybe)
  • Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn’t matter).
  • Now the hard part. You need to check that:
    • There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
    • There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
    • For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don’t end up allowing the suffix regex as delimiter.

Leave a Comment