collect() or toPandas() on a large DataFrame in pyspark/EMR

TL;DR I believe you’re seriously underestimating memory requirements.

Even assuming that data is fully cached, storage info will show only a fraction of peak memory required for bringing data back to the driver.

Since data is actually pretty large I would consider writing it to Parquet and reading it back directly in Python using PyArrow (Reading and Writing the Apache Parquet Format) completely skipping all the intermediate stages.

Leave a Comment