How do I split a string by strings and include the delimiters using .NET?

Despite your reluctance to use regex it actually nicely preserves the delimiters by using a group along with the Regex.Split method:

string input = "123xx456yy789";
string pattern = "(xx|yy)";
string[] result = Regex.Split(input, pattern);

If you remove the parentheses from the pattern, using just "xx|yy", the delimiters are not preserved. Be sure to use Regex.Escape on the pattern if you use any metacharacters that hold special meaning in regex. The characters include \, *, +, ?, |, {, [, (,), ^, $,., #. For instance, a delimiter of . should be escaped \.. Given a list of delimiters, you need to “OR” them using the pipe | symbol and that too is a character that gets escaped. To properly build the pattern use the following code (thanks to @gabe for pointing this out):

var delimiters = new List<string> { ".", "xx", "yy" };
string pattern = "(" + String.Join("|", delimiters.Select(d => Regex.Escape(d))
                                                  .ToArray())
                  + ")";

The parentheses are concatenated rather than included in the pattern since they would be incorrectly escaped for your purposes.

EDIT: In addition, if the delimiters list happens to be empty, the final pattern would incorrectly be () and this would cause blank matches. To prevent this a check for the delimiters can be used. With all this in mind the snippet becomes:

string input = "123xx456yy789";
// to reach the else branch set delimiters to new List();
var delimiters = new List<string> { ".", "xx", "yy", "()" }; 
if (delimiters.Count > 0)
{
    string pattern = "("
                     + String.Join("|", delimiters.Select(d => Regex.Escape(d))
                                                  .ToArray())
                     + ")";
    string[] result = Regex.Split(input, pattern);
    foreach (string s in result)
    {
        Console.WriteLine(s);
    }
}
else
{
    // nothing to split
    Console.WriteLine(input);
}

If you need a case-insensitive match for the delimiters use the RegexOptions.IgnoreCase option: Regex.Split(input, pattern, RegexOptions.IgnoreCase)

EDIT #2: the solution so far matches split tokens that might be a substring of a larger string. If the split token should be matched completely, rather than part of a substring, such as a scenario where words in a sentence are used as the delimiters, then the word-boundary \b metacharacter should be added around the pattern.

For example, consider this sentence (yea, it’s corny): "Welcome to stackoverflow... where the stack never overflows!"

If the delimiters were { "stack", "flow" } the current solution would split “stackoverflow” and return 3 strings { "stack", "over", "flow" }. If you needed an exact match, then the only place this would split would be at the word “stack” later in the sentence and not “stackoverflow”.

To achieve an exact match behavior alter the pattern to include \b as in \b(delim1|delim2|delimN)\b:

string pattern = @"\b("
                + String.Join("|", delimiters.Select(d => Regex.Escape(d)))
                + @")\b";

Finally, if trimming the spaces before and after the delimiters is desired, add \s* around the pattern as in \s*(delim1|delim2|delimN)\s*. This can be combined with \b as follows:

string pattern = @"\s*\b("
                + String.Join("|", delimiters.Select(d => Regex.Escape(d)))
                + @")\b\s*";

Leave a Comment