Spark: long delay between jobs

I/O operations often come with significant overhead that will occur on the master node. Since this work isn’t parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node

  • Spark will write to temporary s3 directories, then move the files using the master node
  • Reading of text files often occur on the master node
  • When writing parquet files, the master node will scan all the files post-write to check the schema

These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.

Discussion of writing I/O Overhead with Parquet and s3

Discussion of reading I/O Overhead “s3 is not a filesystem”

Leave a Comment