Spark: Inconsistent performance number in scaling number of cores

Theoretical limitations

I assume you are familiar Amdahl’s law but here is a quick reminder. Theoretical speedup is defined as followed :

enter image description here

where :

  • s – is the speedup of the parallel part.
  • p – is fraction of the program that can be parallelized.

In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:

enter image description here

(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia
)

Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn’t dream about anything close to 0.95 or higher. This is because

Spark is a high cost abstraction

Spark is designed to work on commodity hardware at the datacenter scale. It’s core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes
and execute long running jobs but it is doesn’t scale down very well.

Spark is not focused on parallel computing

In practice Spark and similar systems are focused on two problems:

  • Reducing overall IO latency by distributing IO operations between multiple nodes.
  • Increasing amount of available memory without increasing the cost per unit.

which are fundamental problems for large scale, data intensive systems.

Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.

With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn’t necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.

Practical implications

  • Spark is not a replacement for multiprocessing or mulithreading on a single machine.
  • Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.

In this context:

Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.

Leave a Comment