Programming - CUDA Memory Management

Introduction

Hey it's a me again drifter1!

Today we continue with the Parallel Programming series around Nvidia's CUDA API to talk about Memory Management.

So, without further ado, let's get straight into it!

GitHub Repository

https://github.com/drifter1/cudatutorials

Requirements - Prerequisites

Knowledge of the Programming Language C, or even C++
Familiarity with Parallel Computing/Programming in general
CUDA-Capable Nvidia GPU (compute capability should not matter that much)
CUDA Toolkit installed
Previous Articles of the Series

Memory Hierarchy

CUDA threads may access data from multiple memory spaces during their execution. More specifically:

Each thread has its own private local memory
Each thread block has shared memory that's visible to all threads of the same block
All threads can access the global memory of the GPU

Nvidia's GPUs also have additional read-only (for the GPU) global memory spaces, like the constant and texture memory spaces. Each is optimized for different use-cases, as their name implies.

Host and Device Memory

The CUDA programming model assumes that the host and device maintain their own separate memory spaces, which are referred as host memory (CPU) and device memory (GPU). Through calls to the CUDA Runtime API its possible to manage the global, constant and texture memory spaces, which are visible to the kernels, and so the operations that will be run on the device/GPU. To get even more specific, the API includes calls for:

device memory allocation and deallocation
data transfer between host and device memory (bi-directional)

Unified Memory

In addition to the separate host and device memory spaces, its also possible to define a single unified managed memory. Managed memory is accessible to both the host and device, and visible as a single, coherent memory. Such memory helps elliminate the explicit need of mirror data between the host and device.

Memory Management API

Device Memory Allocation/Deallocation

Allocating memory on the device is as simple as calling the cudaMalloc() function. The function takes two parameters:

A pointer to the allocated device memory
The requested allocation size in bytes

For example, allocating an 1D integer array A of N elements is done as follows:

int *A;
cudaMalloc(&A, N * sizeof(int));

After the memory has no use anymore it can be deallocated using cudaFree(), which takes in the pointer to the allocated device memory.

Deallocating the array from the previous example is as simple as:

cudaFree(A);

Data Transfer Between Host and Device

Using cudaMalloc() we simply allocated memory for the specified size on the device. In order to transfer data from the host to the device, its necessary to also call the cudaMemcpy() function. The function takes in the following parameters:

Destination memory address
Source memory address
Size in bytes to copy
Type of transfer (cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDefault)

When using cudaMemcpyDefault the transfer type will be inferred from the pointer values. Copying from the host to the device, the destination will be the allocated pointer for the device, and the source will be a memory space on the host's memory.

For example, suppose that an array B has been allocated in some way on the host's memory (variable definition or malloc() call). The code to transfer the array B from the host to the device memory allocated previously is:

cudaMemcpy(A, B, N * sizeof(int), cudaMemcpyHostToDevice);

Let's suppose that a calculation will take place on array A (some kernel call followed by cudaDeviceSynchronize()), and that the results have to be transferred back to the host. To transfer the array A from the device memory to the host memory, and so array B, we then make the following call:

cudaMemcpy(B, A, N * sizeof(int), cudaMemcpyDeviceToHost);

I tend to put "_gpu" after the name of device/GPU memory pointers to distinguish them easier.

Global Variables

Its also possible to skip some of the previous calls by directly declaring global variables using __device__. Such variables have to be declared at global scope, and not within the body of any function.

The array A, from the previous example, can thus also be allocated using:

__device__ int A[N];

Memory Transfer calls have to still be made though.

Managed Memory

In order to use unified managed memory, a variable has to be declared (in addition to __device__) with __managed__. This allows for the absence of any explicit cudaMemcpy() call. Its worth noting that this only works from Compute capability 6.x and further.

For example the arrays A and B from the previous example, can easily be defined as a single memory pointer A:

__device__ __managed_ int A[N];

Managed memory can also be allocated using the API call cudaMallocManaged(), which takes in the same parameters as cudaMalloc(). And so, the array A, from the example throughout the article, can be defined as shared between the host and device using:

int *A;
cudaMallocManaged(&A, sizeof(int) * N);

Of course making all the memory unified can lead to performance issues, and so unified memory is mostly used for results from kernel calculations.

Example Program

Let's write a simple CUDA program that calculates the addition of two N-length vectors, A and B, and stores it in vector C. One way to implement this is by:

declaring two vectors A and B in the host memory, that can be initialized/randomly filled in the host code
declaring two pointers A_gpu and B_gpu for device memory, to which A and B will be copied to, and that will be fed into the kernel
declaring a unified memory pointer C to which the result of the addition will be written to

Pointer Definitions and Memory Allocation

In code, the definition of all these variable is quite similar:

#define N 16

...

int main(){

    // host memory
    int A[N];
    int B[N];
    int i;
  
    // device memory
    int *A_gpu;
    int *B_gpu;
 
    // unified memory
    int *C;

    ...

Only A and B have already been allocated so far. The device memory can be allocated using cudaMalloc() calls, whilst the unified memory using cudaMallocManaged(), as follows:

cudaMalloc(&A_gpu, sizeof(int) * N);
cudaMalloc(&B_gpu, sizeof(int) * N);
cudaMallocManaged(&C, sizeof(int) * N);

Defining C as a unified managed memory space, the cudaMemcpy() call for device to host transfer is avoided. Of course A and B could also be defined as shared between host and device. In this simple example it will not make much of a difference.

Filling the Vectors and Passing them to the Device

The vectors can be easily filled with random values:

srand(time(NULL));
for(i = 0; i < N; i++){
    A[i] = rand() % 100;
    B[i] = rand() % 100;
}

To then copy these values to the corresponding device memory pointers A_gpu and B_gpu, two separate cudaMemcpy() calls have to be made:

cudaMemcpy(A_gpu, A, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(B_gpu, B, N * sizeof(int), cudaMemcpyHostToDevice);

Thread Hiearchy

Because the problem is 1D, the threads should be organized into 1D thread blocks. Let's make the number of blocks a #define NUM_BLOCKS. The number of threads per block is thus equal to N / NUM_BLOCKS. Each thread can be specified by a unique blockIdx.x and threadIdx.x pair. The index of the vectors that each thread is working on is simply:

blockIdx.x * (N / NUM_BLOCKS) + threadIdx.x;

Kernel Definition and Execution

The kernel is defined as follows:

__global__ void A_add_B(int *A, int *B, int *C){
    int i = blockIdx.x * (N / NUM_BLOCKS) + threadIdx.x;
    
    C[i] = A[i] + B[i];
}

Therefore, each thread calculates the value of a different index of the vector result.

The thread blocks are 1D with each block having N / NUM_BLOCKS, and so:

dim3 threadDims;
threadDims.x = N / NUM_BLOCKS;
threadDims.y = 1;
threadDims.z = 1;

The kernel will be executed as follows:

// execute kernel
A_add_B<<<NUM_BLOCKS, threadDims>>>(A_gpu, B_gpu, C);

// wait for device to finish
cudaDeviceSynchronize();

Print results and deallocate memory

The results can be easily print out with a simple for-loop:

for(i = 0; i < N; i++){
    printf("%d = %d + %d\n", C[i], A[i], B[i]);
}

After the results are retrieved, all the memory that has been allocated, has to be deallocated as well, which can be done using cudaFree():

cudaFree(A_gpu);
cudaFree(B_gpu);
cudaFree(C);

Console Output

For N = 32 and NUM_BLOCKS = 8, after compilation, the execution prints out the following results:

From the results we can see that the addition was successful.

RESOURCES:

References

https://docs.nvidia.com/cuda/index.html

Images

Previous articles about the CUDA API

CUDA API Introduction → Installation, GPU Computing, CUDA API
CUDA Thread Hierarchy → Kernels, Grids of Blocks and Threads, Compilation, Example Program