how do we choose –nthreads and –nprocs per worker in dask distributed?

It depends on your workload By default Dask creates a single process with as many threads as you have logical cores on your machine (as determined by multiprocessing.cpu_count()). dask-worker … –nprocs 1 –nthreads 8 # assuming you have eight cores dask-worker … # this is actually the default setting Using few processes and many threads … Read more

What is spark.driver.maxResultSize?

assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize). No. If estimated size of the data is larger than maxResultSize given job will be aborted. The goal here is to protect your application from … Read more

Why isn’t RDBMS Partition Tolerant in CAP Theorem and why is it Available?

It is very easy to misunderstand the CAP properties, hence I’m providing some illustrations to make it easier. Consistency: A query Q will produce the same answer A regardless the node that handles the request. In order to guarantee full consistency we need to ensure that all nodes agree on the same value at all … Read more

Easiest way to install Python dependencies on Spark executor nodes?

Actually having actually tried it, I think the link I posted as a comment doesn’t do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn’t supported better in Spark. … Read more

How does Spark aggregate function – aggregateByKey work?

aggregateByKey() is quite different from reduceByKey. What happens is that reduceByKey is sort of a particular case of aggregateByKey. aggregateByKey() will combine the values for a particular key, and the result of such combination can be any object that you specify. You have to specify how the values are combined (“added”) inside one partition (that … Read more

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more