Confused about backslashes in regular expressions [duplicate]

The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn’t a recognized escape character, then the \ is treated like any other character and passed through, but I don’t recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.

If you want to see how Python is expanding your string escapes, just print out the string. For example:

s="a\\b\tc"
print(s)

If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.

Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.

An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".

Leave a Comment