Why do these goroutines not scale their performance from more concurrent executions?

Fact #0: Pre-mature optimisation efforts often have negative yields
showing they are just a waste of time & efforts


Why?
A single “wrong” SLOC may devastate performance into more than about +37%
or may improve performance to spend less than -57% of the baseline processing time

51.151µs on MA(200) [10000]float64    ~ 22.017µs on MA(200) [10000]int
70.325µs on MA(200) [10000]float64

Why []int-s?
You see it on your own above – this is the bread and butter for HPC/fintech efficient sub-[us] processing strategies ( and we still speak in terms of just [SERIAL] process scheduling ).

This one may test on any scale – but rather test first ( here ) your own implementations, on the very the same scale – MA(200) [10000]float64 setup – and post your baseline durations in [us] to view the initial process performance and to compare apples-to-apples, having the posted 51.2 [us] threshold to compare against.

Next comes the harder part:


Fact #1: This task is NOT embarrasingly parallel

Yes, one may go and implement a Moving Average calculation, so that it indeed proceeds through the heaps of data using some intentionally indoctrinated “just”-[CONCURRENT] processing approach ( irrespective whether being due to some kind of error, some authority’s “advice”, professional blindness or just from a dual-Socrates-fair ignorance ) which obviously does not mean that the nature of the convolutional stream-processing, present inside the Moving Average mathematical formulation, has forgotten to be a pure [SERIAL] process, just due to an attempt to enforce it get calculated inside some degree of “just”-[CONCURRENT] processing.

( Btw. The Hard Computer-Scientists and dual-domain nerds will also object here, that Go-language is by design using best Rob Pike’s skills into having a framework of concurrent coroutines, not any true-[PARALLEL] process-scheduling, even though Hoare’s CSP-tools, available in the language concept, may add some salt and pepper and introduce a stop-block type of inter-process communication tools, that will block “just”-[CONCURRENT] code sections into some hardwired CSP-p2p-synchronisation. )


Fact #2: Go distributed ( for any kind of speedup ) only AT THE END

Having a poor level of performance in [SERIAL] does not set any yardstick. Having a reasonable amount of performance tuning in single-thread, only then one may benefit from going distributed ( still having to pay additional serial costs, which makes Amdahl Law ( rather Overhead-strict-Amdahl Law ) get into the game ).

If one can introduce such low level of additional setup-overheads and still achieve any remarkable parallelism, scaled into the non-[SEQ] part of the processing, there and only there comes a chance to increase the process effective-performance.

It is not hard to loose much more than to gain in this, so always benchmark the pure-[SEQ] against the potential tradeoffs between a non-[SEQ] / N[PAR]_processes theoretical, overhead-naive speedup, for which one will pay a cost of a sum of all add-on-[SEQ]-overheads, so if and only if :

(         pure-[SEQ]_processing      [ns]
+       add-on-[SEQ]-setup-overheads [ns]
+        ( non-[SEQ]_processing      [ns] / N[PAR]_processes )
  ) << (  pure-[SEQ]_processing      [ns]
       + ( non-[SEQ]_processing      [ns] / 1 )
         )

Not having this jet-fighters advantage of both the surplus height and Sun behind you, never attempt going into any kind of HPC / parallelisation attempts – they will never pay for themselves not being remarkably << better, than a smart [SEQ]-process.


enter image description here

Epilogue: on overhead-strict Amdahl’s Law interactive experiment UI

One animation is worth million words.

An interactive animation even better:

So,
assume a process-under-test, which has both a [SERIAL] and a [PARALLEL] part of the process schedule.

Let p be the [PARALLEL] fraction of the process duration ~ ( 0.0 .. 1.0 ) thus the [SERIAL] part does not last longer than ( 1 - p ), right?

So, let’s start interactive experimentation from such a test-case, where the p == 1.0, meaning all such process duration is spent in just a [PARALLEL] part, and the both initial serial and the terminating parts of the process-flow ( that principally are always [SERIAL] ) have zero-durations ( ( 1 - p ) == 0. )

Assume the system does no particular magic and thus needs to spend some real steps on intialisation of each of the [PARALLEL] part, so as to run it on a different processor ( (1), 2, .., N ), so let’s add some overheads, if asked to re-organise the process flow and to marshal + distribute + un-marshal all the necessary instructions and data, so as the intended process now can start and run on N processors in parallel.

These costs are called o ( here initially assumed for simplicity to be just constant and invariant to N, which is not always the case in real, on silicon / on NUMA / on distributed infrastructures ).

By clicking on the Epilogue headline above, an interactive environment opens and is free for one’s own experimentation.

With p == 1. && o == 0. && N > 1 the performance is steeply growing to current achievable [PARALLEL]-hardware O/S limits for a still monolytical O/S code-execution ( where still no additional distribution costs for MPI- and similar depeche-mode distributions of work-units ( where one would immediately have to add indeed a big number of [ms], while our so far best just [SERIAL] implementation has obviously did the whole job in less than just ~ 22.1 [us] ) ).

But except such artificially optimistic case, the job does not look so cheap to get efficiently parallelised.

  • Try having not a zero, but just about ~ 0.01% the setup overhead costs of o, and the line starts to show some very different nature of the overhead-aware scaling for even the utmost extreme [PARALLEL] case ( having still p == 1.0 ), and having the potential speedup somewhere about just near the half of the initially super-idealistic linear speedup case.

  • Now, turn the p to something closer to reality, somewhere less artificially set than the initial super-idealistic case of == 1.00 --> { 0.99, 0.98, 0.95 } and … bingo, this is the reality, where process-scheduling ought be tested and pre-validated.

What does that mean?

As an example, if an overhead ( of launching + final joining a pool of coroutines ) would take more than ~ 0.1% of the actual [PARALLEL] processing section duration, there would be not bigger speedup of 4x ( about a 1/4 of the original duration in time ) for 5 coroutines ( having p ~ 0.95 ), not more than 10x ( a 10-times faster duration ) for 20 coroutines ( all assuming that a system has 5-CPU-cores, resp. 20-CPU-cores free & available and ready ( best with O/S-level CPU-core-affinity mapped processes / threads ) for uninterrupted serving all those coroutines during their whole life-span, so as to achieve any above expected speedups.

Not having such amount of hardware resources free and ready for all of those task-units, intended for implementing the [PARALLEL]-part of the process-schedule, the blocking/waiting states will introduce additional absolute wait-states and the resulting-performance adds these new-[SERIAL]-blocking/waiting sections to the overall process-duration and the initially wished-to-have speedups suddenly cease to exist and performance factor falls well under << 1.00 ( meaning that the effective run-time was due to the blocking-states way slower, than the non-parallelised just-[SERIAL] workflow ).

This may sound complicated for new keen experimentators, however we may put it in a reversed perspective. Given the whole process of distribution the intended [PARALLEL] pool-of-tasks is known not to be shorter than, say, about a 10 [us], the overhead-strict graphs show, there needs to be at least about 1000 x 10 [us] of non-blocking computing intensive processing inside the [PARALLEL] section so as not to devastate the efficiency of the parallelised-processing.

If there is not a sufficiently “fat”-piece of processing, the overhead-costs ( going remarkably above the above cited threshold of ~ 0.1% ) then brutally devastate the net-efficiency of the successfully parallellised-processing ( but having performed at such unjustifiably high relative costs of the setup vs the limited net effects of any-N-processors, as was demonstrated in the available live-graphs ).

There is no surprise for distributed-computing nerds, that the overhead o comes with also additional dependencies – on N ( the more processes, the more efforts are to be spent to distribute work-packages ), on marshalled data-BLOBs’ sizes ( the larger the BLOBs, the longer the MEM-/IO-devices remain blocked, before serving the next process to receive a distributed BLOB across such device/resource for each of the target 2..N-th receiving process ), on avoided / CSP-signalled, channel-mediated inter-process coordinations ( call it additional per-incident blocking, reducing the p further and further below the ultimately nice ideal of 1. ).

So, the real-world reality is rather very far from the initially idealised, nice and promising p == 1.0, ( 1 - p ) == 0.0 and o == 0.0

As obvious from the very beginning, try to beat rather the 22.1 [us] [SERIAL] threshold, than trying to beat this, while getting worse and worse, if going [PARALLEL] where realistic overheads and scaling, using already under-performing approaches, does not help a single bit.

Leave a Comment