How to estimate the thread context switching overhead?

I doubt you can find this overhead somewhere on the web for any existing platform. There exists just too many different platforms. The overhead depends on two factors:

  • The CPU, as the necessary operations may be easier or harder on different CPU types
  • The system kernel, as different kernels will have to perform different operations on each switch

Other factors include how the switch takes place. A switch can take place when

  1. the thread has used all of its time quantum. When a thread is started, it may run for a given amount of time before it has to return control to the kernel that will decide who’s next.

  2. the thread was preempted. This happens when another thread needs CPU time and has a higher priority. E.g. the thread that handles mouse/keyboard input may be such a thread. No matter what thread owns the CPU right now, when the user types something or clicks something, he doesn’t want to wait till the current threads time quantum has been used up completely, he wants to see the system reacting immediately. Thus some systems will make the current thread stop immediately and return control to some other thread with higher priority.

  3. the thread doesn’t need CPU time anymore, because it’s blocking on some operation or just called sleep() (or similar) to stop running.

These 3 scenarios might have different thread switching times in theory. E.g. I’d expect the last one to be slowest, since a call to sleep() means the CPU is given back to the kernel and the kernel needs to setup a wake-up call that will make sure the thread is woken up after about the amount of time it requested to sleep, it then must take the thread out of the scheduling process, and once the thread is woken up, it must add the thread again to the scheduling process. All these steeps will take some amount of time. So the actual sleep-call might be longer than the time it takes to switch to another thread.

I think if you want to know for sure, you must benchmark. The problem is that you usually will have to either put threads to sleep or you must synchronize them using mutexes. Sleeping or Locking/Unlocking mutexes has itself an overhead. This means your benchmark will include these overheads as well. Without having a powerful profiler, it’s hard to later on say how much CPU time was used for the actual switch and how much for the sleep/mutex-call. On the other hand, in a real life scenario, your threads will either sleep or synchronize via locks as well. A benchmark that purely measures the context switch time is a synthetically benchmark as it does not model any real life scenario. Benchmarks are much more “realistic” if they base on real-life scenarios. Of what use is a GPU benchmark that tells me my GPU can in theory handle 2 billion polygons a second, if this result can never be achieved in a real life 3D application? Wouldn’t it be much more interesting to know how many polygons a real life 3D application can have the GPU handle a second?

Unfortunately I know nothing of Windows programming. I could write an application for Windows in Java or maybe in C#, but C/C++ on Windows makes me cry. I can only offer you some source code for POSIX.

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>
#include <unistd.h>

uint32_t COUNTER;
pthread_mutex_t LOCK;
pthread_mutex_t START;
pthread_cond_t CONDITION;

void * threads (
    void * unused
) {
    // Wait till we may fire away
    pthread_mutex_lock(&START);
    pthread_mutex_unlock(&START);

    pthread_mutex_lock(&LOCK);
    // If I'm not the first thread, the other thread is already waiting on
    // the condition, thus Ihave to wake it up first, otherwise we'll deadlock
    if (COUNTER > 0) {
        pthread_cond_signal(&CONDITION);
    }
    for (;;) {
        COUNTER++;
        pthread_cond_wait(&CONDITION, &LOCK);
        // Always wake up the other thread before processing. The other
        // thread will not be able to do anything as long as I don't go
        // back to sleep first.
        pthread_cond_signal(&CONDITION);
    }
    pthread_mutex_unlock(&LOCK); //To unlock
}

int64_t timeInMS ()
{
    struct timeval t;

    gettimeofday(&t, NULL);
    return (
        (int64_t)t.tv_sec * 1000 +
        (int64_t)t.tv_usec / 1000
    );
}


int main (
    int argc,
    char ** argv
) {
    int64_t start;
    pthread_t t1;
    pthread_t t2;
    int64_t myTime;

    pthread_mutex_init(&LOCK, NULL);
    pthread_mutex_init(&START, NULL);   
    pthread_cond_init(&CONDITION, NULL);

    pthread_mutex_lock(&START);
    COUNTER = 0;
    pthread_create(&t1, NULL, threads, NULL);
    pthread_create(&t2, NULL, threads, NULL);
    pthread_detach(t1);
    pthread_detach(t2);
    // Get start time and fire away
    myTime = timeInMS();
    pthread_mutex_unlock(&START);
    // Wait for about a second
    sleep(1);
    // Stop both threads
    pthread_mutex_lock(&LOCK);
    // Find out how much time has really passed. sleep won't guarantee me that
    // I sleep exactly one second, I might sleep longer since even after being
    // woken up, it can take some time before I gain back CPU time. Further
    // some more time might have passed before I obtained the lock!
    myTime = timeInMS() - myTime;
    // Correct the number of thread switches accordingly
    COUNTER = (uint32_t)(((uint64_t)COUNTER * 1000) / myTime);
    printf("Number of thread switches in about one second was %u\n", COUNTER);
    return 0;
}

Output

Number of thread switches in about one second was 108406

Over 100’000 is not too bad and that even though we have locking and conditional waits. I’d guess without all this stuff at least twice as many thread switches were possible a second.

Leave a Comment