[Image 1]
Hey it's a me again drifter1!
Today we continue with the Parallel Programming series around Nvidia's CUDA API to talk about Atomic Functions and Synchronization.
So, without further ado, let's get straight into it!
Until now in the series, it was only shown how the host (CPU) can wait for the device (GPU) to finish the kernel execution.
This was done using the API call cudaDeviceSynchronize().
But what about the execution on the various threads of the GPU?
Is there any synchronization involved that permits only one thread to access a specific chunk of the shared and global memory?
The simple answer is No.
The CUDA programming model is not specifically defining (on its own) in which order a thread writes data to the various memory spaces. Additionally, any another device thread or even the host, might also not observe the data "changes" in the same way. So, similar to any other parallel programming model, threads have undefined behavior and are on an endless data-race, when no synchronization is specified by the programmer.
In order to enforce order to memory access, one can use the memory fence functions that the CUDA API offers. There are various types of memory fence functions that differ by scope, but still enforce order independent from the accessed memory space:
The CUDA API also provides synchronization of threads in the same block using the __syncthreads() function.
This call makes the calling threads wait until all threads in the same thread block have reached this point.
It also ensures that all global and shared memory accesses made by these threads prior to the call of this function will become visible to all threads in the block.
In other words, using __syncthreads() one can coordinate the communication among threads of the same block, therefore solving potential read-after-write, write-after-read and write-after-write data hazards.
CUDA also includes some variations of __syncthreads():
Because using Memory Fence and Synchronization Functions can become quite tricky, CUDA also offers specific functions that perform common read-modify-write operations on data risiding in global or shared memory. These atomics can be:
CUDA provides the following atomic arithmetic functions:
CUDA also includes atomic functions for bitwise operations:
Consider a very large vector A[N] of randomly generated unsigned 32-bit integers, and so values in the range [0, 232). Let's write an algorithm that finds the maximum value inside of vector A:
Input: A[N]
Output: max
Initialize: max ← A[0]
for i = 1, 2, ..., N - 1 do
if A[i] > max then
max ← A[i]
end if
end for
In non-parallel C without CUDA the algorithm is very easy to implement, but also having the worst performance.
My 1080 Ti has 11GB of DRAM, so let's make the vector of size N = 231, meaning that about 8.59 GB will be needed only for storing the vector A.
The vector size can be easily defined using the limits.h constant UINT_MAX (of value 232 - 1) by dividing it by 2 and adding 1:
#define N (UINT_MAX / 2 + 1)
The vector can be defined as an unsigned integer pointer, allocated using calloc() and then deallocated again using free():
unsigned int *A;
...
A = (unsigned int*) calloc(N, sizeof(unsigned int));
...
free(A);
Filling the vector A with random values is as easy as:
srand(time(NULL));
for(i = 0; i < N; i++){
A[i] = (unsigned int) (rand() % UINT_MAX);
}
Last, but not least, the max value calculation can be done as follows:
unsigned int i;
unsigned int max;
max = A[0];
for(i = 1; i < N; i++){
max = (A[i] > max) ? A[i] : max;
}
printf("Max is %u\n", max);
Compiling the program with gcc and then executing it, the following result is shown in the console:
Let's not calculate the execution time yet!
In order to speed-up the process, its possible to split vector A into parts, compute the maximum value of each part and then find the maximum value of all parts. Or in CUDA terms, we simply assign each thread to a different portion of the vector.
To keep things simple, let's only create a single block with threads organized in 1D. Let's run the kernel for various numbers of threads to compare how parallelization improves the computation!
First off, let's turn the serial calculation into a function:
int find_max_serial(unsigned int *A){
unsigned int i;
unsigned int max;
max = A[0];
for(i = 1; i <N; i++){
max = (A[i] > max) ? A[i] : max;
}
return max;
}
Serial calculation is now as simple as:
max = find_max_serial(A);
Next up, let's allocate memory on the device/GPU for the vector (A_gpu) and result (max_gpu) using cudaMalloc():
unsigned int *A_gpu;
unsigned int *max_gpu;
cudaMalloc(&A_gpu, N * sizeof(unsigned int));
cudaMalloc(&max_gpu, sizeof(unsigned int));
Afterwards the memory can be easily deallocated using cudaFree():
cudaFree(A_gpu);
cudaFree(max_gpu);
The vector A can be easily transferred to the device using cudaMemcpy():
cudaMemcpy(A_gpu, A, N * sizeof(unsigned int), cudaMemcpyHostToDevice);
The result (max_gpu) has to be initialized and transferred to the device, and later on brought back to the host, again using cudaMemcpy():
max = 0;
cudaMemcpy(max_gpu, &max, sizeof(unsigned int), cudaMemcpyHostToDevice);
...
cudaMemcpy(&max, max_gpu, sizeof(unsigned int), cudaMemcpyDeviceToHost);
The kernel function for the device/GPU is declared as:
__global__ void find_max_parallel(unsigned int *A, unsigned int *max_block);
Using the threadIdx.x and blockDim.x variables available to the thread its easy to calculate the range it has to operate on:
unsigned int low_index = threadIdx.x * (N / blockDim.x);
unsigned int high_index = (threadIdx.x + 1) * (N / blockDim.x);
The thread then computes the max value for its range as follows:
max = A[low_index];
for(i = low_index + 1; i < high_index; i++){
max = (A[i] > max) ? A[i] : max;
}
The max value of the block is then changed using atomicMax():
atomicMax(max_block, max);
The thread block dimensions are defined as a dim3 structure:
dim3 threadDims;
threadDims.x = 1;
threadDims.y = 1;
threadDims.z = 1;
The kernel is executed for all multiples of 2 from 2 up to 1024 threads:
for(i = 2; i <= 1024; i*=2){
threadDims.x = i;
max = 0;
cudaMemcpy(max_gpu, &max, sizeof(unsigned int), cudaMemcpyHostToDevice);
find_max_parallel<<<1, threadDims>>>(A_gpu, max_gpu);
cudaDeviceSynchronize();
}
An easy way to calculate the execution time of kernels, and even host functions, is using events!
Events are defined and created as follows:
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
Before the function that should be "recorded" cudaEventRecord(start); has to be called.cudaEventRecord(stop);
cudaEventSynchronize(stop);
float time_elapsed = 0;
cudaEventElapsedTime(&time_elapsed, start, stop);
The events are then destroyed using:
cudaEventDestroy(start);
cudaEventDestroy(stop);
Printing out the max value calculated, as well as the execution time, the code prints out:
We conclude, that the parallelization works correctly, because the result is the same as the serial calculation.
Also, parallelization only starts to benefit after using 16 cuda threads, with the optimal being 64 that took about 1.7 seconds.
It's worth noting that the parallelization also starts affecting the algorithm in a negative way when the amount of threads is pretty high.
For example, 256 threads have somewhat the same performance as 4 threads(!).
Perfomance starts to degrade, because the atomicMax() operation has to be executed in a much higher number of threads, turning the parallel cuda-gpus code basically into serial code!
Thus, parallelization should always be used with caution!
And this is actually it for today's post!
I'm not sure about the next article yet, but there is surely a lot more to cover!
See ya!Keep on drifting!