Is there something like a counter variable in regular expression replace?

FMTEYEWTK about Fancy Regexes

Ok, I’m going to go from the simple to the sublime. Enjoy!

Simple s///e Solution

Given this:

#!/usr/bin/perl

$_ = <<"End_of_G&S";
    This particularly rapid,
        unintelligible patter
    isn't generally heard,
        and if it is it doesn't matter!
End_of_G&S

my $count = 0;

Then this:

s{
    \b ( [\w']+ ) \b
}{
    sprintf "(%s)[%d]", $1, ++$count;
}gsex;

produces this

(This)[1] (particularly)[2] (rapid)[3],
    (unintelligible)[4] (patter)[5]
(isn't)[6] (generally)[7] (heard)[8], 
    (and)[9] (if)[10] (it)[11] (is)[12] (it)[13] (doesn't)[14] (matter)[15]!

Interpolated Code in Anon Array Solution

Whereas this:

s/\b([\w']+)\b/#@{[++$count]}=$1/g;

produces this:

#1=This #2=particularly #3=rapid,
    #4=unintelligible #5=patter
#6=isn't #7=generally #8=heard, 
    #9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!

Solution with code in LHS instead of RHS

This puts the incrementation within the match itself:

s/ \b ( [\w']+ ) \b (?{ $count++ }) /#$count=$1/gx;

yields this:

#1=This #2=particularly #3=rapid,
    #4=unintelligible #5=patter
#6=isn't #7=generally #8=heard, 
    #9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!

A Stuttering Stuttering Solution Solution Solution

This

s{ \b ( [\w'] + ) \b             }
 { join " " => ($1) x ++$count   }gsex;

generates this delightful answer:

This particularly particularly rapid rapid rapid,
    unintelligible unintelligible unintelligible unintelligible patter patter patter patter patter
isn't isn't isn't isn't isn't isn't generally generally generally generally generally generally generally heard heard heard heard heard heard heard heard, 
    and and and and and and and and and if if if if if if if if if if it it it it it it it it it it it is is is is is is is is is is is is it it it it it it it it it it it it it doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't matter matter matter matter matter matter matter matter matter matter matter matter matter matter matter!

Exploring Boundaries

There are more robust approaches to word boundaries that work for plural possessives (the previous approaches don’t), but I suspect your mystery lies in getting the ++$count to fire, not with the subtleties of \b behavior.

I really wish people understood that \b isn’t what they think it is.
They always think it means there’s white space or the edge of the string
there. They never think of it as \w\W or \W\w transitions.

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

As you see, it’s conditional depending on what it’s touching. That’s what the (?(COND)THEN|ELSE) clause is for.

This becomes an issue with things like:

$_ = qq('Tis Paul's parents' summer-house, isn't it?\n);
my $count = 0;

s{
    (?(?=[\-\w']) (?<![\-\w'])  | (?<![^\-\w']) )
    ( [\-\w'] + )
    (?(?<=[\-\w']) (?![\-\w'])  | (?![^\-\w'])  )
}{
    sprintf "(%s)[%d]", $1, ++$count
}gsex;

print;

which correctly prints

('Tis)[1] (Paul's)[2] (parents')[3] (summer-house)[4], (isn't)[5] (it)[6]?

Worrying about Unicode

1960s-style ASCII is about 50 years out of date. Just as whenever you see anyone write [a-z], it’s nearly always wrong, it turns out that things like dashes and quotation marks shouldn’t show up as literals in patterns, either. While we’re at it, you probably don’t want to use \w, because that includes numbers and underscores as well, not just alphabetics.

Imagine this string:

$_ = qq(\x{2019}Tis Ren\x{E9}e\x{2019}s great\x{2010}grandparents\x{2019} summer\x{2010}house, isn\x{2019}t it?\n);

which you could have as a literal with use utf8:

use utf8;
$_ = qq(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?\n);

This time I’ll go at the pattern a bit differently, separating out my definition of terms from their execution to try to make it more readable and thence maintainable:

#!/usr/bin/perl -l
use 5.10.0;
use utf8;
use open qw< :std :utf8 >;
use strict;
use warnings qw< FATAL all >;
use autodie;

$_ = q(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?);

my $count = 0;

s{ (?<WORD> (?&full_word)  )

   # the rest is just definition
   (?(DEFINE)

     (?<word_char>   [\p{Alphabetic}\p{Quotation_Mark}] )

     (?<full_word>

             # next line won't compile cause
             # fears variable-width lookbehind
             ####  (?<! (?&word_char) )   )
             # so must inline it

         (?<! [\p{Alphabetic}\p{Quotation_Mark}] )

         (?&word_char)
         (?:
             \p{Dash}
           | (?&word_char)
         ) *

         (?!  (?&word_char) )
     )

   )   # end DEFINE declaration block

}{
    sprintf "(%s)[%d]", $+{WORD}, ++$count;
}gsex;

print;

That code when run produces this:

(’Tis)[1] (Renée’s)[2] (great‐grandparents’)[3] (summer‐house)[4], (isn’t)[5] (it)[6]?

Ok, so that may have beeen FMTEYEWTK about fancy regexes, but aren’t you glad you asked? ☺

Leave a Comment