When is it best to use Regular Expressions over basic string splitting / substring'ing?

My main guideline is to use regular expressions for throwaway code, and for user-input validation. Or when I’m trying to find a specific pattern within a big glob of text. For most other purposes, I’ll write a grammar and implement a simple parser.

One important guideline (that’s really hard to sidestep, though I see people try all the time) is to always use a parser in cases where the target language’s grammar is recursive.

For example, consider a tiny “expression language” for evaluating parenthetized arithmetic expressions. Examples of “programs” in this language would look like this:

1 + 2
5 * (10 - 6)
((1 + 1) / (2 + 2)) / 3

A grammar is easy to write, and looks something like this:

DIGIT := ["0"-"9"]
NUMBER := (DIGIT)+
OPERATOR := ("+" | "-" | "*" | "https://stackoverflow.com/" )
EXPRESSION := (NUMBER | GROUP) (OPERATOR EXPRESSION)?
GROUP := "(" EXPRESSION ")"

With that grammar, you can build a recursive descent parser in a jiffy.

An equivalent regular expression is REALLY hard to write, because regular expressions don’t usually have very good support for recursion.

Another good example is JSON ingestion. I’ve seen people try to consume JSON with regular expressions, and it’s INSANE. JSON objects are recursive, so they’re just begging for regular grammars and recursive descent parsers.

Hmmmmmmm… Looking at other people’s responses, I think I may have answered the wrong question.

I interpreted it as “when should use use a simple regex, rather than a full-blown parser?” whereas most people seem to have interpreted the question as “when should you roll your own clumsy ad-hoc character-by-character validation scheme, rather than using a regular expression?”

Given that interpretation, my answer is: never.

Okay…. one more edit.

I’ll be a little more forgiving of the roll-your-own scheme. Just… don’t call it “parsing” :o)

I think a good rule of thumb is that you should only use string-matching primitives if you can implement ALL of your logic using a single predicate. Like this:

if (str.equals("DooWahDiddy")) // No problemo.

if (str.contains("destroy the earth")) // Okay.

if (str.indexOf(";") < str.length / 2) // Not bad.

Once your conditions contain multiple predicates, then you’ve started inventing your own ad hoc string validation language, and you should probably just man up and study some regular expressions.

if (str.startsWith("I") && str.endsWith("Widget") &&
    (!str.contains("Monkey") || !str.contains("Pox")))  // Madness.

Regular expressions really aren’t that hard to learn. Compared to a huuuuge full-featured language like C# with dozens of keywords, primitive types, and operators, and a standard library with thousands of classes, regular expressions are absolutely dirt simple. Most regex implementations support about a dozen or so operations (give or take).

Here’s a great reference:

http://www.regular-expressions.info/

PS: As a bonus, if you ever do want to learn about writing your own parsers (with lex/yacc, ANTLR, JavaCC, or other similar tools), learning regular expressions is a great preparation, because parser-generator tools use many of the same principles.

When is it best to use Regular Expressions over basic string splitting / substring’ing?

Leave a Comment Cancel reply

More Related Contents:

Leave a Comment Cancel reply