Cuda atomics change flag

It looks to me like what you want is a “critical section” in your code. A critical section allows one thread to execute a sequence of instructions while preventing any other thread or threadblock from executing those instructions.

A critical section can be used to control access to a memory area, for example, so as to allow un-conflicted access to that area by a single thread.

Atomics by themselves can only be used for a very limited, basically single operation, on a single variable. But atomics can be used to build a critical section.

You should use the following code in your kernel to control thread access to a critical section:

__syncthreads();
if (threadIdx.x == 0)
  acquire_semaphore(&sem);
__syncthreads();
  //begin critical section
  // ... your critical section code goes here
  //end critical section
__threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
__syncthreads();
if (threadIdx.x == 0)
  release_semaphore(&sem);
__syncthreads();

Prior to the kernel define these helper functions and device variable:

__device__ volatile int sem = 0;

__device__ void acquire_semaphore(volatile int *lock){
  while (atomicCAS((int *)lock, 0, 1) != 0);
  }

__device__ void release_semaphore(volatile int *lock){
  *lock = 0;
  __threadfence();
  }

I have tested and used successfully the above code. Note that it essentially arbitrates between threadblocks using thread 0 in each threadblock as a requestor. You should further condition (e.g. if (threadIdx.x < ...)) your critical section code if you want only one thread in the winning threadblock to execute the critical section code.

Having multiple threads within a warp arbitrate for a semaphore presents additional complexities, so I don’t recommend that approach. Instead, have each threadblock arbitrate as I have shown here, and then control your behavior within the winning threadblock using ordinary threadblock communication/synchronization methods (e.g. __syncthreads(), shared memory, etc.)

Note that this methodology will be costly to performance. You should only use critical sections when you cannot figure out how to otherwise parallelize your algorithm.

Finally, a word of warning. As in any threaded parallel architecture, improper use of critical sections can lead to deadlock. In particular, making assumptions about order of execution of threadblocks and/or warps within a threadblock is a flawed approach.

Leave a Comment