Checking for corrupted files in directory with hundreds of thousands of images gradually slows down

If you have that many images, I would suggest you use multiprocessing. I created 100,000 files of which 5% were corrupt and checked them like this:

#!/usr/bin/env python3

import glob
from multiprocessing import Pool
from PIL import Image

def CheckOne(f):
    try:
        im = Image.open(f)
        im.verify()
        im.close()
        # DEBUG: print(f"OK: {f}")
        return
    except (IOError, OSError, Image.DecompressionBombError):
        # DEBUG: print(f"Fail: {f}")
        return f

if __name__ == '__main__':
    # Create a pool of processes to check files
    p = Pool()

    # Create a list of files to process
    files = [f for f in glob.glob("*.jpg")]

    print(f"Files to be checked: {len(files)}")

    # Map the list of files to check onto the Pool
    result = p.map(CheckOne, files)

    # Filter out None values representing files that are ok, leaving just corrupt ones
    result = list(filter(None, result)) 
    print(f"Num corrupt files: {len(result)}")

Sample Output

Files to be checked: 100002
Num corrupt files: 5001

That takes 1.6 seconds on my 12-core CPU with NVME disk, but should still be noticeably faster for you.

Leave a Comment