Tokenizing a String but ignoring delimiters within quotes

It’s much easier to use a java.util.regex.Matcher and do a find() rather than any kind of split in these kinds of scenario.

That is, instead of defining the pattern for the delimiter between the tokens, you define the pattern for the tokens themselves.

Here’s an example:

    String text = "1 2 \"333 4\" 55 6    \"77\" 8 999";
    // 1 2 "333 4" 55 6    "77" 8 999

    String regex = "\"([^\"]*)\"|(\\S+)";

    Matcher m = Pattern.compile(regex).matcher(text);
    while (m.find()) {
        if (m.group(1) != null) {
            System.out.println("Quoted [" + m.group(1) + "]");
        } else {
            System.out.println("Plain [" + m.group(2) + "]");
        }
    }

The above prints (as seen on ideone.com):

Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]

The pattern is essentially:

"([^"]*)"|(\S+)
 \_____/  \___/
    1       2

There are 2 alternates:

The first alternate matches the opening double quote, a sequence of anything but double quote (captured in group 1), then the closing double quote
The second alternate matches any sequence of non-whitespace characters, captured in group 2
The order of the alternates matter in this pattern

Note that this does not handle escaped double quotes within quoted segments. If you need to do this, then the pattern becomes more complicated, but the Matcher solution still works.

References

regular-expressions.info/Brackets for Grouping and Capturing, Alternation with Vertical Bar, Character Class, Repetition with Star and Plus

Appendix

Note that StringTokenizer is a legacy class. It’s recommended to use java.util.Scanner or String.split, or of course java.util.regex.Matcher for most flexibility.

Tokenizing a String but ignoring delimiters within quotes

References

See also

Appendix

Related questions

Leave a Comment Cancel reply

References

See also

Appendix

Related questions

More Related Contents:

Leave a Comment Cancel reply