Extracting words from a string, removing punctuation and returning a list with separated words

This has nothing to do with splitting and punctuation; you just care about the letters (and numbers), and just want a regular expression:

import re
def getWords(text):
    return re.compile('\w+').findall(text)

Demo:

>>> re.compile('\w+').findall('Hello world, my name is...James the 2nd!')
['Hello', 'world', 'my', 'name', 'is', 'James', 'the', '2nd']

If you don’t care about numbers, replace \w with [A-Za-z] for just letters, or [A-Za-z'] to include contractions, etc. There are probably fancier ways to include alphabetic-non-numeric character classes (e.g. letters with accents) with other regex.


I almost answered this question here: Split Strings with Multiple Delimiters?

But your question is actually under-specified: Do you want 'this is: an example' to be split into:

  • ['this', 'is', 'an', 'example']
  • or ['this', 'is', 'an', '', 'example']?

I assumed it was the first case.


[this’, ‘is’, ‘an’, example’] is what i want. is there a method without importing regex? If we can just replace the non ascii_letters with ”, then splitting the string into words in a list, would that work? – James Smith 2 mins ago

The regexp is the most elegant, but yes, you could this as follows:

def getWords(text):
    """
        Returns a list of words, where a word is defined as a
        maximally connected substring of uppercase or lowercase
        alphabetic letters, as defined by "a".isalpha()

        >>> get_words('Hello world, my name is... Élise!')  # works in python3
        ['Hello', 'world', 'my', 'name', 'is', 'Élise']
    """
    return ''.join((c if c.isalnum() else ' ') for c in text).split()

or .isalpha()


Sidenote: You could also do the following, though it requires importing another standard library:

from itertools import *

# groupby is generally always overkill and makes for unreadable code
# ... but is fun

def getWords(text):
    return [
        ''.join(chars)
            for isWord,chars in 
            groupby(' My name, is test!', lambda c:c.isalnum()) 
            if isWord
    ]

If this is homework, they’re probably looking for an imperative thing like a two-state Finite State Machine where the state is “was the last character a letter” and if the state changes from letter -> non-letter then you output a word. Don’t do that; it’s not a good way to program (though sometimes the abstraction is useful).

Leave a Comment