collect() or toPandas() on a large DataFrame in pyspark/EMR

TL;DR I believe you’re seriously underestimating memory requirements. Even assuming that data is fully cached, storage info will show only a fraction of peak memory required for bringing data back to the driver. First of all Spark SQL uses compressed columnar storage for caching. Depending on the data distribution and compression algorithm in-memory size can … Read more

How to submit Spark jobs to EMR cluster from Airflow?

While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow Use Apache Livy This solution is actually independent of remote server, i.e., EMR Here’s an example The downside is that Livy is in early stages and its API appears incomplete and wonky … Read more