[Image 1]
Hey it's a me again drifter1!
Today we continue with the Parallel Programming series around Nvidia's CUDA API to talk about Memory Management.
So, without further ado, let's get straight into it!
CUDA threads may access data from multiple memory spaces during their execution. More specifically:
The CUDA programming model assumes that the host and device maintain their own separate memory spaces, which are referred as host memory (CPU) and device memory (GPU). Through calls to the CUDA Runtime API its possible to manage the global, constant and texture memory spaces, which are visible to the kernels, and so the operations that will be run on the device/GPU. To get even more specific, the API includes calls for:
In addition to the separate host and device memory spaces, its also possible to define a single unified managed memory. Managed memory is accessible to both the host and device, and visible as a single, coherent memory. Such memory helps elliminate the explicit need of mirror data between the host and device.
Allocating memory on the device is as simple as calling the cudaMalloc() function. The function takes two parameters:
int *A;
cudaMalloc(&A, N * sizeof(int));
After the memory has no use anymore it can be deallocated using cudaFree(), which takes in the pointer to the allocated device memory.
Deallocating the array from the previous example is as simple as:
cudaFree(A);
Using cudaMalloc() we simply allocated memory for the specified size on the device. In order to transfer data from the host to the device, its necessary to also call the cudaMemcpy() function. The function takes in the following parameters:
For example, suppose that an array B has been allocated in some way on the host's memory (variable definition or malloc() call). The code to transfer the array B from the host to the device memory allocated previously is:
cudaMemcpy(A, B, N * sizeof(int), cudaMemcpyHostToDevice);
Let's suppose that a calculation will take place on array A (some kernel call followed by cudaDeviceSynchronize()), and that the results have to be transferred back to the host.
To transfer the array A from the device memory to the host memory, and so array B, we then make the following call:
cudaMemcpy(B, A, N * sizeof(int), cudaMemcpyDeviceToHost);
I tend to put "_gpu" after the name of device/GPU memory pointers to distinguish them easier.
Its also possible to skip some of the previous calls by directly declaring global variables using __device__. Such variables have to be declared at global scope, and not within the body of any function.
The array A, from the previous example, can thus also be allocated using:
__device__ int A[N];
Memory Transfer calls have to still be made though.
In order to use unified managed memory, a variable has to be declared (in addition to __device__) with __managed__. This allows for the absence of any explicit cudaMemcpy() call. Its worth noting that this only works from Compute capability 6.x and further.
For example the arrays A and B from the previous example, can easily be defined as a single memory pointer A:
__device__ __managed_ int A[N];
Managed memory can also be allocated using the API call cudaMallocManaged(), which takes in the same parameters as cudaMalloc(). And so, the array A, from the example throughout the article, can be defined as shared between the host and device using:
int *A;
cudaMallocManaged(&A, sizeof(int) * N);
Of course making all the memory unified can lead to performance issues, and so unified memory is mostly used for results from kernel calculations.
Let's write a simple CUDA program that calculates the addition of two N-length vectors, A and B, and stores it in vector C. One way to implement this is by:
In code, the definition of all these variable is quite similar:
#define N 16
...
int main(){
// host memory
int A[N];
int B[N];
int i;
// device memory
int *A_gpu;
int *B_gpu;
// unified memory
int *C;
...
Only A and B have already been allocated so far.
The device memory can be allocated using cudaMalloc() calls, whilst the unified memory using cudaMallocManaged(), as follows:
cudaMalloc(&A_gpu, sizeof(int) * N);
cudaMalloc(&B_gpu, sizeof(int) * N);
cudaMallocManaged(&C, sizeof(int) * N);
Defining C as a unified managed memory space, the cudaMemcpy() call for device to host transfer is avoided.
Of course A and B could also be defined as shared between host and device.
In this simple example it will not make much of a difference.
The vectors can be easily filled with random values:
srand(time(NULL));
for(i = 0; i < N; i++){
A[i] = rand() % 100;
B[i] = rand() % 100;
}
To then copy these values to the corresponding device memory pointers A_gpu and B_gpu, two separate cudaMemcpy() calls have to be made:
cudaMemcpy(A_gpu, A, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(B_gpu, B, N * sizeof(int), cudaMemcpyHostToDevice);
Because the problem is 1D, the threads should be organized into 1D thread blocks.
Let's make the number of blocks a #define NUM_BLOCKS.
The number of threads per block is thus equal to N / NUM_BLOCKS.
Each thread can be specified by a unique blockIdx.x and threadIdx.x pair.
The index of the vectors that each thread is working on is simply:
blockIdx.x * (N / NUM_BLOCKS) + threadIdx.x;
The kernel is defined as follows:
__global__ void A_add_B(int *A, int *B, int *C){
int i = blockIdx.x * (N / NUM_BLOCKS) + threadIdx.x;
C[i] = A[i] + B[i];
}
Therefore, each thread calculates the value of a different index of the vector result.
The thread blocks are 1D with each block having N / NUM_BLOCKS, and so:
dim3 threadDims;
threadDims.x = N / NUM_BLOCKS;
threadDims.y = 1;
threadDims.z = 1;
The kernel will be executed as follows:
// execute kernel
A_add_B<<<NUM_BLOCKS, threadDims>>>(A_gpu, B_gpu, C);
// wait for device to finish
cudaDeviceSynchronize();
The results can be easily print out with a simple for-loop:
for(i = 0; i < N; i++){
printf("%d = %d + %d\n", C[i], A[i], B[i]);
}
After the results are retrieved, all the memory that has been allocated, has to be deallocated as well, which can be done using cudaFree():
cudaFree(A_gpu);
cudaFree(B_gpu);
cudaFree(C);
For N = 32 and NUM_BLOCKS = 8, after compilation, the execution prints out the following results:
From the results we can see that the addition was successful.
And this is actually it for today's post!
Next time we might continue with Atomic Operations and Synchronization...
See ya!Keep on drifting!