How to recursively list directories in C on Linux?

Why does everyone insist on reinventing the wheel again and again?

POSIX.1-2008 standardized the nftw() function, also defined in the Single Unix Specification v4 (SuSv4), and available in Linux (glibc, man 3 nftw), OS X, and most current BSD variants. It is not new at all.

Naïve opendir()/readdir()/closedir() -based implementations almost never handle the cases where directories or files are moved, renamed, or deleted during the tree traversal, whereas nftw() should handle them gracefully.

As an example, consider the following C program that lists the directory tree starting at the current working directory, or at each of the directories named on the command line, or just the files named at the command line:

/* We want POSIX.1-2008 + XSI, i.e. SuSv4, features */
#define _XOPEN_SOURCE 700

/* Added on 2017-06-25:
   If the C library can support 64-bit file sizes
   and offsets, using the standard names,
   these defines tell the C library to do so. */
#define _LARGEFILE64_SOURCE
#define _FILE_OFFSET_BITS 64 

#include <stdlib.h>
#include <unistd.h>
#include <ftw.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>

/* POSIX.1 says each process has at least 20 file descriptors.
 * Three of those belong to the standard streams.
 * Here, we use a conservative estimate of 15 available;
 * assuming we use at most two for other uses in this program,
 * we should never run into any problems.
 * Most trees are shallower than that, so it is efficient.
 * Deeper trees are traversed fine, just a bit slower.
 * (Linux allows typically hundreds to thousands of open files,
 *  so you'll probably never see any issues even if you used
 *  a much higher value, say a couple of hundred, but
 *  15 is a safe, reasonable value.)
*/
#ifndef USE_FDS
#define USE_FDS 15
#endif

int print_entry(const char *filepath, const struct stat *info,
                const int typeflag, struct FTW *pathinfo)
{
    /* const char *const filename = filepath + pathinfo->base; */
    const double bytes = (double)info->st_size; /* Not exact if large! */
    struct tm mtime;

    localtime_r(&(info->st_mtime), &mtime);

    printf("%04d-%02d-%02d %02d:%02d:%02d",
           mtime.tm_year+1900, mtime.tm_mon+1, mtime.tm_mday,
           mtime.tm_hour, mtime.tm_min, mtime.tm_sec);

    if (bytes >= 1099511627776.0)
        printf(" %9.3f TiB", bytes / 1099511627776.0);
    else
    if (bytes >= 1073741824.0)
        printf(" %9.3f GiB", bytes / 1073741824.0);
    else
    if (bytes >= 1048576.0)
        printf(" %9.3f MiB", bytes / 1048576.0);
    else
    if (bytes >= 1024.0)
        printf(" %9.3f KiB", bytes / 1024.0);
    else
        printf(" %9.0f B  ", bytes);

    if (typeflag == FTW_SL) {
        char   *target;
        size_t  maxlen = 1023;
        ssize_t len;

        while (1) {

            target = malloc(maxlen + 1);
            if (target == NULL)
                return ENOMEM;

            len = readlink(filepath, target, maxlen);
            if (len == (ssize_t)-1) {
                const int saved_errno = errno;
                free(target);
                return saved_errno;
            }
            if (len >= (ssize_t)maxlen) {
                free(target);
                maxlen += 1024;
                continue;
            }

            target[len] = '\0';
            break;
        }

        printf(" %s -> %s\n", filepath, target);
        free(target);

    } else
    if (typeflag == FTW_SLN)
        printf(" %s (dangling symlink)\n", filepath);
    else
    if (typeflag == FTW_F)
        printf(" %s\n", filepath);
    else
    if (typeflag == FTW_D || typeflag == FTW_DP)
        printf(" %s/\n", filepath);
    else
    if (typeflag == FTW_DNR)
        printf(" %s/ (unreadable)\n", filepath);
    else
        printf(" %s (unknown)\n", filepath);

    return 0;
}


int print_directory_tree(const char *const dirpath)
{
    int result;

    /* Invalid directory path? */
    if (dirpath == NULL || *dirpath == '\0')
        return errno = EINVAL;

    result = nftw(dirpath, print_entry, USE_FDS, FTW_PHYS);
    if (result >= 0)
        errno = result;

    return errno;
}

int main(int argc, char *argv[])
{
    int arg;

    if (argc < 2) {

        if (print_directory_tree(".")) {
            fprintf(stderr, "%s.\n", strerror(errno));
            return EXIT_FAILURE;
        }

    } else {

        for (arg = 1; arg < argc; arg++) {
            if (print_directory_tree(argv[arg])) {
                fprintf(stderr, "%s.\n", strerror(errno));
                return EXIT_FAILURE;
            }
        }

    }

    return EXIT_SUCCESS;
}

Most of the code above is in print_entry(). Its task is to print out each directory entry. In print_directory_tree(), we tell nftw() to call it for each directory entry it sees.

The only hand-wavy detail above is the decision on how many file descriptors one should let nftw() use. If your program uses at most two extra file descriptors (in addition to the standard streams) during the file tree walk, 15 is known to be safe (on all systems having nftw() and being mostly POSIX-compliant).

In Linux, you could use sysconf(_SC_OPEN_MAX) to find the maximum number of open files, and subtract the number you use concurrently with the nftw() call, but I wouldn’t bother (unless I knew the utility would be used mostly with pathologically deep directory structures). Fifteen descriptors does not limit the tree depth; nftw() just gets slower (and might not detect changes in a directory if walking a directory deeper than 13 directories from that one, although the tradeoffs and general ability to detect changes vary between systems and C library implementations). Just using a compile-time constant like that keeps the code portable — it should work not just on Linux, but on Mac OS X and all current BSD variants, and most other not-too-old Unix variants, too.

In a comment, Ruslan mentioned that they had to switch to nftw64() because they had filesystem entries that required 64-bit sizes/offsets, and the “normal” version of nftw() failed with errno == EOVERFLOW. The correct solution is to not switch to GLIBC-specific 64-bit functions, but to define _LARGEFILE64_SOURCE and _FILE_OFFSET_BITS 64. These tell the C library to switch to 64-bit file sizes and offsets if possible, while using the standard functions (nftw(), fstat(), et cetera) and type names (off_t etc.).

Leave a Comment