Programming - CUDA Atomic Functions and Synchronization

Introduction

Hey it's a me again drifter1!

Today we continue with the Parallel Programming series around Nvidia's CUDA API to talk about Atomic Functions and Synchronization.

So, without further ado, let's get straight into it!

GitHub Repository

https://github.com/drifter1/cudatutorials

Requirements - Prerequisites

Knowledge of the Programming Language C, or even C++
Familiarity with Parallel Computing/Programming in general
CUDA-Capable Nvidia GPU (compute capability should not matter that much)
CUDA Toolkit installed
Previous Articles of the Series

Synchronization

Until now in the series, it was only shown how the host (CPU) can wait for the device (GPU) to finish the kernel execution. This was done using the API call cudaDeviceSynchronize(). But what about the execution on the various threads of the GPU? Is there any synchronization involved that permits only one thread to access a specific chunk of the shared and global memory? The simple answer is No.

The CUDA programming model is not specifically defining (on its own) in which order a thread writes data to the various memory spaces. Additionally, any another device thread or even the host, might also not observe the data "changes" in the same way. So, similar to any other parallel programming model, threads have undefined behavior and are on an endless data-race, when no synchronization is specified by the programmer.

Memory Fence Functions

In order to enforce order to memory access, one can use the memory fence functions that the CUDA API offers. There are various types of memory fence functions that differ by scope, but still enforce order independent from the accessed memory space:

__threadfence_block()
- Applies to all threads in the current program that execute on the same thread block as the calling thread.
- Enforces that all writes to memory by the calling thread made before the call to __threadfence_block() are observed by all threads executing on the same thread block before all writes to memory made after the call.
- Enforces that all reads to memory by the calling thread made before the call to __threadfence_block() are ordered before all reads to memory after the call.
__threadfence()
- Applies to all threads in the current program that execute on the same compute device as the calling thread.
- Same as __threadfence_block(), but with the changes being observed by all threads executing on the same compute device.
__threadfence_system()
- Applies to all threads in the current program that execute on any device, including other CPUS and GPUs.
- Same as __threadfence_block(), but with the changes being observed by all threads executing on any device, including host threads.

Its worth noting that memory fence functions only affect the order of the memory operations by a thread, but do not ensure that these operations will be visible to other threads. In order to ensure the correct visibility, and that no cached versions of data will be used, the volatile qualifier should be used on the variables that should be synchronized.

Synchronization Functions

The CUDA API also provides synchronization of threads in the same block using the __syncthreads() function. This call makes the calling threads wait until all threads in the same thread block have reached this point. It also ensures that all global and shared memory accesses made by these threads prior to the call of this function will become visible to all threads in the block. In other words, using __syncthreads() one can coordinate the communication among threads of the same block, therefore solving potential read-after-write, write-after-read and write-after-write data hazards.

CUDA also includes some variations of __syncthreads():

__syncthreads_count() - with the additional feature of evaluating the predicate for all threads of the block, and returning the NUMBER of threads for which it evaluates to zero
__syncthreads_and() - -//-, and returning non-zero if and only if the predicate evaluates to non-zero for ALL threads
__syncthreads_or() - -//-, and returning non-zero if and only if the predicate evaluates to non-zero for ANY thread

Atomic Functions

Because using Memory Fence and Synchronization Functions can become quite tricky, CUDA also offers specific functions that perform common read-modify-write operations on data risiding in global or shared memory. These atomics can be:

system-wide with _system suffix, and so affecting all threads executing the current program, including CPUs and GPUs
device-wide with no suffix, and so affecting all threads executing the current program on the same compute device
block-wide with _block suffix, and so affecting all threads in the same thread block

The operation is guaranteed to be performed without any interference from other threads, meaning no other thread access the same address until the operation completes.

Arithmetic Functions

CUDA provides the following atomic arithmetic functions:

atomicAdd(address, val)
- reads the old value located at address
- computes old + val
- stores the result back to the same address
- returns the old value
atomicSub(address, val)
- similar to atomicAdd, but computing old - val
atomicExch(address, val)
- reads the old value located at address
- stores val at the same address
- returns the old value
atomicMin(address, val)
- reads the old value located at address
- computes the minimum of old and val
- stores the result back to the same address
- returns the old value
atomicMax(address, val)
- similar to atomicMin, but computing and returning the maximum
atomicInc(address, val)
- reads the old value located at address
- computes ((old >= val) ? 0 : (old+1))
- stores the result back to the same address
- returns the old value
atomicDec(address, val)
- similar to atomicInc, but computing (((old == 0) || (old > val)) ? val : (old-1) )
atomicCAS(address, compare, val)
- reads the old value located at address
- computes (old == compare ? val : old)
- stores the result back to the same address
- returns the old value

Using atomicCAS() any atomic operation can be implemented.

Bitwise Functions

CUDA also includes atomic functions for bitwise operations:

atomicAnd(address, val)
- reads the old value located at address
- computes old & val
- stores the result back to the same address
- returns the old value
atomicOr(address, val)
- similar to atomicAnd, but computing old | val
atomicXor(address, val)
- similar to atomicAnd, but computing old ^ val

Using combinations of these three functions any other bitwise operation like not, nor or nand can be easily implemented.

Example Programs

Algorithm Description

Consider a very large vector A[N] of randomly generated unsigned 32-bit integers, and so values in the range [0, 2³²). Let's write an algorithm that finds the maximum value inside of vector A:

Input: A[N]
Output: max
Initialize: max ← A[0]
for i = 1, 2, ..., N - 1 do
    if A[i] > max then
        max ← A[i]
    end if
end for

Non-Parallel C Code

In non-parallel C without CUDA the algorithm is very easy to implement, but also having the worst performance. My 1080 Ti has 11GB of DRAM, so let's make the vector of size N = 2³¹, meaning that about 8.59 GB will be needed only for storing the vector A. The vector size can be easily defined using the limits.h constant UINT_MAX (of value 2³² - 1) by dividing it by 2 and adding 1:

#define N (UINT_MAX / 2 + 1)

The vector can be defined as an unsigned integer pointer, allocated using calloc() and then deallocated again using free():

unsigned int *A;

...

A = (unsigned int*) calloc(N, sizeof(unsigned int));

...

free(A);

Filling the vector A with random values is as easy as:

srand(time(NULL));   
for(i = 0; i < N; i++){
    A[i] = (unsigned int) (rand() % UINT_MAX);
}

Last, but not least, the max value calculation can be done as follows:

unsigned int i;
unsigned int max;

max = A[0];
for(i = 1; i < N; i++){
    max = (A[i] > max) ? A[i] : max;
}

printf("Max is %u\n", max);

Compiling the program with gcc and then executing it, the following result is shown in the console:

Let's not calculate the execution time yet!

Cuda Code

In order to speed-up the process, its possible to split vector A into parts, compute the maximum value of each part and then find the maximum value of all parts. Or in CUDA terms, we simply assign each thread to a different portion of the vector.

To keep things simple, let's only create a single block with threads organized in 1D. Let's run the kernel for various numbers of threads to compare how parallelization improves the computation!

Serial Calculation as Function

First off, let's turn the serial calculation into a function:

int find_max_serial(unsigned int *A){
    unsigned int i;
    unsigned int max;

    max = A[0];
    for(i = 1; i <N; i++){
        max = (A[i] > max) ? A[i] : max;
    }
   
    return max;
}

Serial calculation is now as simple as:

max = find_max_serial(A);

Device Memory Allocation/Deallocation

Next up, let's allocate memory on the device/GPU for the vector (A_gpu) and result (max_gpu) using cudaMalloc():

unsigned int *A_gpu;
unsigned int *max_gpu;

cudaMalloc(&A_gpu, N * sizeof(unsigned int));
cudaMalloc(&max_gpu, sizeof(unsigned int));

Afterwards the memory can be easily deallocated using cudaFree():

cudaFree(A_gpu);
cudaFree(max_gpu);

Data Transfer between Host/Device

The vector A can be easily transferred to the device using cudaMemcpy():

cudaMemcpy(A_gpu, A, N * sizeof(unsigned int), cudaMemcpyHostToDevice);

The result (max_gpu) has to be initialized and transferred to the device, and later on brought back to the host, again using cudaMemcpy():

max = 0;
cudaMemcpy(max_gpu, &max, sizeof(unsigned int), cudaMemcpyHostToDevice);

...

cudaMemcpy(&max, max_gpu, sizeof(unsigned int), cudaMemcpyDeviceToHost);

Kernel Function

The kernel function for the device/GPU is declared as:

__global__ void find_max_parallel(unsigned int *A, unsigned int *max_block);

Using the threadIdx.x and blockDim.x variables available to the thread its easy to calculate the range it has to operate on:

unsigned int low_index = threadIdx.x * (N / blockDim.x);
unsigned int high_index = (threadIdx.x + 1) * (N / blockDim.x);

The thread then computes the max value for its range as follows:

max = A[low_index];
for(i = low_index + 1; i < high_index; i++){
    max = (A[i] > max) ? A[i] : max;
}

The max value of the block is then changed using atomicMax():

atomicMax(max_block, max);

The thread block dimensions are defined as a dim3 structure:

dim3 threadDims;
threadDims.x = 1;
threadDims.y = 1;
threadDims.z = 1;

The kernel is executed for all multiples of 2 from 2 up to 1024 threads:

for(i = 2; i <= 1024; i*=2){
    threadDims.x = i;

    max = 0;
    cudaMemcpy(max_gpu, &max, sizeof(unsigned int), cudaMemcpyHostToDevice);

    find_max_parallel<<<1, threadDims>>>(A_gpu, max_gpu);

    cudaDeviceSynchronize();
}

CUDA Events for Timing

An easy way to calculate the execution time of kernels, and even host functions, is using events!

Events are defined and created as follows:

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

Before the function that should be "recorded" cudaEventRecord(start); has to be called.

After the function we write:

cudaEventRecord(stop);
cudaEventSynchronize(stop);
float time_elapsed = 0;
cudaEventElapsedTime(&time_elapsed, start, stop);

The events are then destroyed using:

cudaEventDestroy(start);
cudaEventDestroy(stop);

Console Output

Printing out the max value calculated, as well as the execution time, the code prints out:

We conclude, that the parallelization works correctly, because the result is the same as the serial calculation. Also, parallelization only starts to benefit after using 16 cuda threads, with the optimal being 64 that took about 1.7 seconds. It's worth noting that the parallelization also starts affecting the algorithm in a negative way when the amount of threads is pretty high. For example, 256 threads have somewhat the same performance as 4 threads(!). Perfomance starts to degrade, because the atomicMax() operation has to be executed in a much higher number of threads, turning the parallel cuda-gpus code basically into serial code! Thus, parallelization should always be used with caution!

RESOURCES:

References

https://docs.nvidia.com/cuda/index.html

Images

Previous articles about the CUDA API

CUDA API Introduction → Installation, GPU Computing, CUDA API
CUDA Thread Hierarchy → Kernels, Grids of Blocks and Threads, Compilation, Example Program
CUDA Memory Management → Memory Hierarchy (Host, Device and Unified), Memory Management API (Allocation/Deallocation, Data Transfer), Example Program

Final words | Next up

And this is actually it for today's post!

I'm not sure about the next article yet, but there is surely a lot more to cover!

See ya!

Keep on drifting!