Don’t sort 10 million lines in memory. Split this up in batches instead:
-
Run 100 100k line sorts (using the file as an iterator, combined with
islice()
or similar to pick a batch). Write out to separate files elsewhere. -
Merge the sorted files. Here is an merge generator that you can pass 100 open files and it’ll yield lines in sorted order. Write to a new file line by line:
import operator def mergeiter(*iterables, **kwargs): """Given a set of sorted iterables, yield the next value in merged order Takes an optional `key` callable to compare values by. """ iterables = [iter(it) for it in iterables] iterables = {i: [next(it), i, it] for i, it in enumerate(iterables)} if 'key' not in kwargs: key = operator.itemgetter(0) else: key = lambda item, key=kwargs['key']: key(item[0]) while True: value, i, it = min(iterables.values(), key=key) yield value try: iterables[i][0] = next(it) except StopIteration: del iterables[i] if not iterables: raise