amazon-emr
Specify minimum number of generated files from Hive insert
The number of files generated during INSERT … SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured. If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in … Read more
collect() or toPandas() on a large DataFrame in pyspark/EMR
TL;DR I believe you’re seriously underestimating memory requirements. Even assuming that data is fully cached, storage info will show only a fraction of peak memory required for bringing data back to the driver. First of all Spark SQL uses compressed columnar storage for caching. Depending on the data distribution and compression algorithm in-memory size can … Read more
How to submit Spark jobs to EMR cluster from Airflow?
While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow Use Apache Livy This solution is actually independent of remote server, i.e., EMR Here’s an example The downside is that Livy is in early stages and its API appears incomplete and wonky … Read more