Capturing text before and after a C-style code block with a Perl regular expression

Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:

m/
  (.+?)  # group 1
  (?:  # the $code_block regex
    (?&block)
    (?(DEFINE)
      (?<block> ... )  # group 2
    )
  )
  (.+)  # group 3
/xs

Named groups can also be accessed as numbered groups.

The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.

As a consequence, the text after the code-block will be stored in capture $3.

There are two ways to deal with this problem:

  • For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:

    if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
        print $+{before};
        print $+{afterwards};
    }
    
  • Put all your defines at the end, where they can’t mess up your capture numbering. For example, your $code_block regex would only define a named pattern which you then invoke explicitly.

Leave a Comment