import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
is exactly the same
unicode as u'[[1P].+?[/P]]+?'
except harder to read.
The first bracketed group [[1P]
tells re that any of the characters in the list ['[', '1', 'P']
should match, and similarly with the second bracketed group [/P]]
.That’s not what you want at all. So,
- Remove the outer enclosing square brackets. (Also remove the
stray1
in front ofP
.) - To protect the literal brackets in
[P]
, escape the brackets with a
backslash:\[P\]
. - To return only the words inside the tags, place grouping parentheses
around.+?
.