amazon-emr - w3toppers.com

The number of files generated during INSERT … SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured. If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in … Read more

collect() or toPandas() on a large DataFrame in pyspark/EMR

TL;DR I believe you’re seriously underestimating memory requirements. Even assuming that data is fully cached, storage info will show only a fraction of peak memory required for bringing data back to the driver. First of all Spark SQL uses compressed columnar storage for caching. Depending on the data distribution and compression algorithm in-memory size can … Read more

How to submit Spark jobs to EMR cluster from Airflow?

While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow Use Apache Livy This solution is actually independent of remote server, i.e., EMR Here’s an example The downside is that Livy is in early stages and its API appears incomplete and wonky … Read more

AWS EMR – ModuleNotFoundError: No module named ‘pyarrow’

Dealing with a large gzipped file in Spark

Saving dataframe to local file system results in empty results

Specify minimum number of generated files from Hive insert

collect() or toPandas() on a large DataFrame in pyspark/EMR

How to submit Spark jobs to EMR cluster from Airflow?