os.walk very slow, any way to optimise?

Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk was rewritten to be more efficient.

This work done as part of PEP 471.

Extracted from the PEP:

Python’s built-in os.walk() is significantly slower than it needs to
be, because — in addition to calling os.listdir() on each directory
— it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

But the underlying system calls — FindFirstFile / FindNextFile on
Windows and readdir on POSIX systems — already tell you whether the
files returned are directories or not, so no further system calls are
needed. Further, the Windows system calls return all the information
for a stat_result object on the directory entry, such as file size and
last modification time.

In short, you can reduce the number of system calls required for a
tree function like os.walk() from approximately 2N to N, where N is
the total number of files and directories in the tree. (And because
directory trees are usually wider than they are deep, it’s often much
better than this.)

In practice, removing all those extra system calls makes os.walk()
about 8-9 times as fast on Windows, and about 2-3 times as fast on
POSIX systems
. So we’re not talking about micro-optimizations. See
more benchmarks here.

Leave a Comment