How to sort an array of string alphabetically (case sensitive, nonstandard collation)

Keep alike words together…

For lists of words, it is often more useful to group the “same” words together (even though they differ in case). For example:

Keeping things together:          Simple "M after m":
------------------------          -------------------
mars                              mars
mars bar                          mars bar
Mars bar                          milk
milk                              milk-duds
Milk                              milky-way
milk-duds                         Mars bar
milky-way                         Milk
Milky-way                         Milky-way

If you want words arranged like the first column, I present three ways of doing so:

  • Using strcasecmp() combined with strcmp().
  • A single pass implementation tracking the character type with isalpha(), tolower(), and isupper().
  • A single pass implementation that utilizes a collating table.

In the end I discuss two alternatives:

  • Using the collating table to establish an arbitrary ordering.
  • Setting the locale to use locale based collating.

Using available library functions

If it is possible to do so, avoid reinventing the wheel. In this case, we can do so by using the POSIX function strcasecmp() to see if they are equal with a case-insensitive comparison, and falling back on strcmp() when they are.

int alphaBetize (const char *a, const char *b) {
    int r = strcasecmp(a, b);
    if (r) return r;
    /* if equal ignoring case, use opposite of strcmp() result to get
     * lower before upper */
    return -strcmp(a, b); /* aka: return strcmp(b, a); */
}

(On some systems, the case-insensitive comparison function is called stricmp() or _stricmp(). If one is not available to you, an implementation is provided below.)

#ifdef I_DONT_HAVE_STRCASECMP
int strcasecmp (const char *a, const char *b) {
    while (*a && *b) {
        if (tolower(*a) != tolower(*b)) {
            break;
        }
        ++a;
        ++b;
    }
    return tolower(*a) - tolower(*b);
}
#endif

Avoiding two passes over the strings

Sometimes, existing functions do not perform well enough, and you have to do something else to make things faster. The following function does the comparison in roughly the same way in a single pass, and without using either strcasecmp() or strcmp(). But, it treats all non-alphabetical characters as being less than letters.

int alphaBetize (const char *a, const char *b) {
    int weight = 0;
    do {
        if (*a != *b) {
            if (!(isalpha(*a) && isalpha(*b))) {
                if (isalpha(*a) || isalpha(*b)) {
                    return isalpha(*a) - isalpha(*b);
                }
                return *a - *b;
            }
            if (tolower(*a) != tolower(*b)) {
                return tolower(*a) - tolower(*b);
            }
            /* treat as equal, but mark the weight if not set */
            if (weight == 0) {
                weight = isupper(*a) - isupper(*b);
            }
        }
        ++a;
        ++b;
    } while (*a && *b);
    /* if the words compared equal, use the weight as tie breaker */
    if (*a == *b) {
        return weight;
    }
    return !*b - !*a;
}

Using this comparison for sorting will keep milk and Milk next to each other even if the list includes milk-duds.

Using a collating table

Here is a way to dynamically create a collating table from a “configuration”. It serves to illustrate a contrastive technique to change how strings get compared.

You can map how the letters of the alphabet are compared with a kind of simple table that describes the relative order you want letters (or any character except the NUL byte) to have:

const char * alphaBetical =
    "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ";

From this ordering, we can create a look-up table to see how two letters are supposed to compare to each other. The following function initializes the table if it has not already been done first, and otherwise performs the table look-up.

int alphaBeta_lookup (int c) {
    static int initialized;
    static char table[CHAR_MAX+1];
    if (!initialized) {
        /* leave all non-alphaBeticals in their relative order, but below
           alphaBeticals */
        int i, j;
        for (i = j = 1; i < CHAR_MAX+1; ++i) {
            if (strchr(alphaBetical, i)) continue;
            table[i] = j++;
        }
        /* now run through the alphaBeticals */
        for (i = 0; alphaBetical[i]; ++i) {
            table[(int)alphaBetical[i]] = j++;
        }
        initialized = 1;
    }
    /* return the computed ordinal of the provided character */
    if (c < 0 || c > CHAR_MAX) return c;
    return table[c];
}

With this look-up table, we can now simplify the loop body of the alphaBetize() comparison function:

int alphaBetize (const char *a, const char *b) {
    int ax = alphaBeta_lookup(*a);
    int bx = alphaBeta_lookup(*b);
    int weight = 0;
    do {
        char al = tolower(*a);
        char bl = tolower(*b);
        if (ax != bx) {
            if (al != bl) {
                return alphaBeta_lookup(al) - alphaBeta_lookup(bl);
            }
            if (weight == 0) {
                weight = ax - bx;
            }
        }
        ax = alphaBeta_lookup(*++a);
        bx = alphaBeta_lookup(*++b);
    } while (ax && bx);
    /* if the words compared equal, use the weight as tie breaker */
    return (ax != bx) ? !bx - !ax : weight;
}

Can we make things simpler?

Using the collating table, you can create many different orderings with a simplified comparison function, like:

int simple_collating (const char *a, const char *b) {
    while (alphaBeta_lookup(*a) == alphaBeta_lookup(*b)) {
        if (*a == '\0') break;
        ++a, ++b;
    }
    return alphaBeta_lookup(*a) - alphaBeta_lookup(*b);
}

Using this same function and through modifying the alphaBetical string, you can achieve nearly any ordering you want (alphabetical, reverse alphabetical, vowels before consonants, etc.). However, the arrangement of keeping alike words together requires interspersing capitalized words with words in lowercase, and this can only be done by doing a comparison that ignores case.

Note that with the simple_collating() function above and the alphaBetical string I provided, Bacon will come before milk, but Mars will go after milk and before Milk.

If you want to sort based on your locale.

If you want to use a collating sequence that is already defined for your locale, you can set the locale and call the collating comparison function:

/*
 * To change the collating locale, use (for example):
       setlocale(LC_COLLATE, "en.US");
 */
int iso_collating (const char *a, const char *b) {
    return strcoll(a, b);
}

Now, by changing the locale, the sorting order will be based on a standardized collating sequence.

Leave a Comment