Split a text into single words

Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as “don’t” and “he said ‘ouch!'”

Leave a Comment