“vertical” regex matching in an ASCII “image”

Answer to question 1

To answer the first question one could use:

(?xm)                    # ignore comments and whitespace, ^ matches beginning of line
^                        # beginning of line
(?:
    .                    # any character except \n
    (?=                  # lookahead
        .*+\n            # go to next line
        ( \1?+ . )       # add a character to the 1st capturing group
        .*+\n            # next line
        ( \2?+ . )       # add a character to the 2nd capturing group
    )
)*?                      # repeat as few times as needed
X .*+\n                  # X on the first line and advance to next line
\1?+                     # if 1st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X .*+\n                  # X on the 2nd line and advance to next line
\2?+                     # if 2st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X                        # X on the 3rd line

Online demo

This expression works in Perl, PCRE, Java and should work in .NET.

The expression uses lookaheads with self referencing capturing groups to add a character for every repetition of the lookahead (this is used to “count”).

\1?+ means if \1 matches (or is defined) consume it, and don’t give it back (don’t backtrack). In this case it’s equivalent to (?(1) \1 ). Which means match \1 if \1 is defined.

polygenelubricants explains this kinds of lookaheads with backreferences very nicely in his answer for How can we match a^n b^n with Java regex?. (He has also written about other impressive tricks for Java regex involving backreferences and lookarounds.)

Answer to question 2

Plain matching

When just using matching and requiring the answer (count) in the number of matches, then the question 2 answer would be:

It can not be directly solved in regex flavors that have a limited lookbehind. While other flavors like Java and .NET could (as for example in m.buettner’s .NET solution).

Thus plain regex matches in Perl and PCRE (PHP, etc) cannot directly answer this question in this case.

(Semi?)proof

Assume that no variable length lookbehinds are available.

You have to in some way count the number of characters on a line before an X.
Only way to do that is to match them, and since no variable length lookbehinds are available you have to start the match (at least) at the beginning of the line.
If you start the match at the beginning of a line you can only get at most one match per line.

Since there can be multiple occurrences per line, this would not count them all and would not give a correct answer.

Length/indirect solution

On the other hand if we accept the answer as the length of a match or substitution result, then the 2nd question can be answered in PCRE and Perl (and other flavors).

This solution is based on/inspired by m.buettner’s nice “partial PCRE solution”.

One could simply replace all matches of the following expression with $3, getting the answer to question two (the number of patterns of interests) as the length of the resulting string.

^
(?:
    (?:                   # match .+? characters
        .
        (?=               # counting the same number on the following two lines
            .*+\n
            ( \1?+ . )
            .*+\n
            ( \2?+ . )
        )
    )+?
    (?<= X )              # till the above consumes an X
    (?=                   # that matches the following conditions
        .*+\n
        \1?+
        (?<= X )
        .*+\n
        \2?+
        (?<= X )
    )
    (?=                   # count the number of matches
        .*+\n
        ( \3?+ . )        # the number of matches = length of $3
    )
)*                        # repeat as long as there are matches on this line
.*\n?                     # remove the rest of the line

Which in Perl could be written as:

$in =~ s/regex/$3/gmx;
$count = length $in;

Online demo

This expression is similar to the solution to question 1 above, with some modifications to include X in the characters matched in the first lookahead, wrapped with a quantifier and counting number of matches of the quantifier.

Except for direct matches this is as close as it gets (extra code wise besides regex), and could be an acceptable answer to question 2.

Test cases

Some test cases and results for the above solution. Result showing the numerical answer (length of the resulting string) and in parenthesis the resulting string after the substitution(s).

Test #0:
--------------------
X
X
X

result: 1 (X)


Test #1:
--------------------
..X....
..X....
..X....

result: 1 (.)


Test #2:
--------------------
..X.X..
..X.X..
....X..

result: 1 (.)


Test #3:
--------------------
..X....
..X....
...X...

result: 0 ()


Test #4:
--------------------
..X....
...X...
..X....

result: 0 ()


Test #5:
--------------------
....X..
.X..X..
.X.....

result: 0 ()


Test #6:
--------------------
.X..X..
.X.X...
.X.X...

result: 1 (.)


Test #7:
--------------------
.X..X..
.X..X..
.X..X..

result: 2 (.X)


Test #8:
--------------------
XXX
XXX
XXX

result: 3 (XXX)


Test #9:
--------------------
X.X.X
XXXXX
XXXXX
.X.X.

result: 5 (XXXXX)


Test #10:
--------------------
1....X.......
2..X..X...X....
3X.X...X..X.....
4X....XXXXXX.....
5X..XXX...........
6.....X..........
7.........X....X
8..X......X....X....
9..X......X....X....X...
A....X.....
B.X..X..
C.....
XXX
XXX
XXX
.

result: 8 (3458.XXX)

Leave a Comment