Is there a way to efficiently yield every file in a directory containing millions of files?

tl;dr <update>: As of Python 3.5 (currently in beta) just use os.scandir
</update>

As I’ve written earlier, since “iglob” is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python.
The low level functions are different for Windows and Posix/Linux systems.

  • If you are on Windows, you should check if win32api has any call to read “the next entry from a dir” or how to proceed otherwise.
  • If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.

The documentation on the C functions is here:
http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory

http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory

I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h header file and verifying the Python snippet is correct (your Python Structure must match the C struct) before using the snippet.

Here is the snippet using ctypes and libc I’ve put together that allow you to get each filename, and perform actions on it. Note that ctypes automatically gives you a Python string when you do str(...) on the char array defined on the structure. (I am using the print statement, which implicitly calls Python’s str)

#!/usr/bin/env python2
from ctypes import *

libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))

class Dirent(Structure):
    _fields_ = [("d_ino",  c_voidp),
                ("off_t", c_int64),
                ("d_reclen", c_ushort),
                ("d_type", c_ubyte),
                ("d_name", c_char * 2048)
            ]

while True:
    p  = libc.readdir64(dir_)
    if not p:
        break
    entry = Dirent.from_address( p)
    print entry.d_name

update: Python 3.5 is now in beta – and in Python 3.5 the new os.scandir function call is available as the materialization of PEP 471 (“a better and faster directory iterator”) which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir on large-directories listing under Windows (2-3 fold increase in Posix systems).

[footnote-1] The dirent64 C struct is determined at C compile time for each system.

Leave a Comment