This has nothing to do with splitting and punctuation; you just care about the letters (and numbers), and just want a regular expression:
import re
def getWords(text):
return re.compile('\w+').findall(text)
Demo:
>>> re.compile('\w+').findall('Hello world, my name is...James the 2nd!')
['Hello', 'world', 'my', 'name', 'is', 'James', 'the', '2nd']
If you don’t care about numbers, replace \w
with [A-Za-z]
for just letters, or [A-Za-z']
to include contractions, etc. There are probably fancier ways to include alphabetic-non-numeric character classes (e.g. letters with accents) with other regex.
I almost answered this question here: Split Strings with Multiple Delimiters?
But your question is actually under-specified: Do you want 'this is: an example'
to be split into:
['this', 'is', 'an', 'example']
- or
['this', 'is', 'an', '', 'example']
?
I assumed it was the first case.
[this’, ‘is’, ‘an’, example’] is what i want. is there a method without importing regex? If we can just replace the non ascii_letters with ”, then splitting the string into words in a list, would that work? – James Smith 2 mins ago
The regexp is the most elegant, but yes, you could this as follows:
def getWords(text):
"""
Returns a list of words, where a word is defined as a
maximally connected substring of uppercase or lowercase
alphabetic letters, as defined by "a".isalpha()
>>> get_words('Hello world, my name is... Élise!') # works in python3
['Hello', 'world', 'my', 'name', 'is', 'Élise']
"""
return ''.join((c if c.isalnum() else ' ') for c in text).split()
or .isalpha()
Sidenote: You could also do the following, though it requires importing another standard library:
from itertools import *
# groupby is generally always overkill and makes for unreadable code
# ... but is fun
def getWords(text):
return [
''.join(chars)
for isWord,chars in
groupby(' My name, is test!', lambda c:c.isalnum())
if isWord
]
If this is homework, they’re probably looking for an imperative thing like a two-state Finite State Machine where the state is “was the last character a letter” and if the state changes from letter -> non-letter then you output a word. Don’t do that; it’s not a good way to program (though sometimes the abstraction is useful).