Using multiprocessing.Manager.list instead of a real list makes the calculation take ages

Linux uses copy-on-write when subprocesses are os.forked. To demonstrate:

import multiprocessing as mp
import numpy as np
import logging
import os

logger = mp.log_to_stderr(logging.WARNING)

def free_memory():
    total = 0
    with open('/proc/meminfo', 'r') as f:
        for line in f:
            line = line.strip()
            if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
                field, amount, unit = line.split()
                amount = int(amount)
                if unit != 'kB':
                    raise ValueError(
                        'Unknown unit {u!r} in /proc/meminfo'.format(u = unit))
                total += amount
    return total

def worker(i):
    x = data[i,:].sum()    # Exercise access to data
    logger.warn('Free memory: {m}'.format(m = free_memory()))

def main():
    procs = [mp.Process(target = worker, args = (i, )) for i in range(4)]
    for proc in procs:
        proc.start()
    for proc in procs:
        proc.join()

logger.warn('Initial free: {m}'.format(m = free_memory()))
N = 15000
data = np.ones((N,N))
logger.warn('After allocating data: {m}'.format(m = free_memory()))

if __name__ == '__main__':
    main()

which yielded

[WARNING/MainProcess] Initial free: 2522340
[WARNING/MainProcess] After allocating data: 763248
[WARNING/Process-1] Free memory: 760852
[WARNING/Process-2] Free memory: 757652
[WARNING/Process-3] Free memory: 757264
[WARNING/Process-4] Free memory: 756760

This shows that initially there was roughly 2.5GB of free memory.
After allocating a 15000×15000 array of float64s, there was 763248 KB free. This roughly makes sense since 15000**2*8 bytes = 1.8GB and the drop in memory, 2.5GB – 0.763248GB is also roughly 1.8GB.

Now after each process is spawned, the free memory is again reported to be ~750MB. There is no significant decrease in free memory, so I conclude the system must be using copy-on-write.

Conclusion: If you do not need to modify the data, defining it at the global level of the __main__ module is a convenient and (at least on Linux) memory-friendly way to share it among subprocesses.

More Related Contents:

Leave a Comment Cancel reply