What is the Python way of doing a \G anchored parsing loop?

Emulate \G at beginning of a regex with re.RegexObject.match

You can emulate the effect of \G at the beginning of a regex with re module by keeping track of and providing the starting position to re.RegexObject.match, which forces the match to begin at the specified position in pos.

def tokenize(w):
    index = 0
    m = matcher.match(w, index)
    o = []
    # Although index != m.end() check zero-length match, it's more of
    # a guard against accidental infinite loop.
    # Don't expect a regex which can match empty string to work.
    # See Caveat section.
    while m and index != m.end():
        o.append(m.group(1))
        index = m.end()
        m = matcher.match(w, index)
    return o

Caveat

A caveat to this method is that it doesn’t play well with regex which matches empty string in the main match, since Python doesn’t have any facility to force the regex to retry the match while preventing zero-length match.

As an example, re.findall(r'(.??)', 'abc') returns an array of 4 empty strings ['', '', '', ''], whereas in PCRE, you can find 7 matches ['', 'a', '', 'b', '', 'c' ''] where the 2nd, 4th, and 6th matches start at the same indices as the 1st, 3rd and 5th matches respectively. The additional matches in PCRE are found by retrying at the same indices with a flag which prevents empty string match.

I know the question is about Perl, not PCRE, but the global matching behavior should be the same. Otherwise, the original code couldn’t have worked.

Rewriting ([^a-zA-Z0-9]*)([a-zA-Z0-9]*?) to (.+?), as done in the question, avoids this issue, though you might want to use re.S flag.

Other comments on the regex

Since case-insensitive flag in Python affects the whole pattern, the case insensitive sub-patterns have to be rewritten. I would rewrite (?i:st) as [sS][tT] to preserve the original meaning, but go with (?:st|ST) if it’s part of your requirement.

Since Python supports the free-spacing mode with re.X flag, you can write your regex similar to what you did in Perl code:

matcher = re.compile(r'''
    (.+?)
    (?:               # identify the token boundary
        (?=[^a-zA-Z0-9])       # next character is not a word character 
    |   (?=[A-Z][a-z])         # Next two characters are upper lower
    |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
    |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
            # ordinal boundaries
    |   (?<=^1[sS][tT])         # first
    |   (?<=[^1][1][sS][tT])    # first but not 11th
    |   (?<=^2[nN][dD])         # second
    |   (?<=[^1]2[nN][dD])      # second but not 12th
    |   (?<=^3[rR][dD])         # third
    |   (?<=[^1]3[rR][dD])      # third but not 13th
    |   (?<=1[123][tT][hH])     # 11th - 13th
    |   (?<=[04-9][tT][hH])     # other ordinals
            # non-ordinal digit-letter boundaries
    |   (?<=^1)(?=[a-zA-Z])(?![sS][tT])       # digit-letter but not first
    |   (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT])    # digit-letter but not 11th
    |   (?<=^2)(?=[a-zA-Z])(?![nN][dD])       # digit-letter but not first
    |   (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD])    # digit-letter but not 12th
    |   (?<=^3)(?=[a-zA-Z])(?![rR][dD])       # digit-letter but not first
    |   (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD])    # digit-letter but not 13th
    |   (?<=1[123])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not 11th - 13th
    |   (?<=[04-9])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not ordinal
    |   (?=$)                               # end of string
    )
''', re.X)

Leave a Comment