Programming - CUDA API Introduction

Introduction

Hey it's a me again drifter1! Today marks the start of a new Parallel Programming series. After covering the basics of multi-threading and multi-process parallelization in Networking, a little bit of MPI (Message Passing Interface) for Distributed Programming, and the OpenMP (Open Multi-Processing) API for easier multi-threaded, shared-memory parallelism, its now time to get into more advanced topics. By advanced I of course mean highly multi-threaded processing, which can be achieved by using GPUs for example.

For GPU Computing there are two APIs out there:

Nvidia's CUDA API, which only works with Nvidia Graphics Cards
OpenCL, which works with any Graphics Card

Using APIs such as CUDA or OpenCL its possible to use GPUs for general-purpose parallel computing and programming!

Because the CUDA API is specifically implemented for Nvidia Graphics Cards its also much easier to begin with, and thus this series will be about Nvidia's CUDA API!

Last, but not least, this series will be guided by Nvidia's Documentation on CUDA, but also on my own knowledge and skills that I gained from various projects.

So, without further ado, let's dive straight into it!

GitHub Repository

The code of this series will be uploaded to a GitHub Repository, that is yet to be created!

Requirements - Prerequisites

Knowledge of the Programming Language C, or even C++
Familiarity with Parallel Computing/Programming in general
CUDA-Capable Nvidia GPU (compute capability should not matter that much)
CUDA Toolkit installed

Installation Guide

The Documentation of the API is fantastic, meaning that all possible installations should be covered.

Example for Pascal Architecture and Ubuntu OS

I personally have a GeForce GTX 1080 Ti, which is of the Pascal Architecture and am using Ubuntu 20.04 LTS as my operating system.

To install the CUDA Toolkit on a GNU/Linux System like Ubuntu, there are basically two choices:

Install from the Package Repository using the Package Manager (apt on Ubuntu)
Manual Runfile Installation

Because Ubuntu's repository is mostly up-to-date, manual runfile installation makes only sense if the latest features are a must. Also note that manual installation also means manual updating!

So, after verifying that the GPU and Operating System is CUDA-Capable from the Pre-Installation Actions, its as simple as:

Adding the CUDA repository meta-data (sudo dpkg -i ...)
Installing the CUDA public GPG key (sudo apt-key add ..., sudo apt-key adv ..., etc.)
Updating the Repository cache (sudo apt-get update)
Installing CUDA (sudo apt-get install cuda)

GPU Computing

So, why should you care? Why is general-purpose parallel computing using the GPU so popular?

Benefits of GPU Computing

GPUs offer much higher instruction thoughput and memory bandwidth than CPUs of the same price and power
Lots of applications run faster on the GPU that on the CPU
FPGAs are more energy efficient but offer less flexibility than GPUs

Why are GPUS so capable?

Well its simple, GPUs and CPUs are designed for different purposes:

CPUs excel at executing sequences of operations quickly, in a few tens of threads in parallel (high single-thread performance)
GPUs excel at executing thousands of threads parallel (with quite slower single-thread performance but higher throughput)

[Image 2]

GPUs are designed for highly parallel computing, which is about data processing and computation, rather than data caching and flow control. Thus, GPUs have less memory access latency.

Should applications only run on the GPU?

Most applications have to mix parallel and sequential parts, and so CPUs and GPUs are combined together in order to maximize the overall performance. If the application benefits of high-degrees of parallelism then the massive parallel nature of the GPU will of course achieve higher performance then CPUs. If the application is mostly sequential then parallelism can even make things less efficient, which of course also a problem in CPU multi-processing or multi-threading!

CUDA API

So, after this brief Introduction to the world of GPU Computing, let's now head back to CUDA!

The Nvidia CUDA API is a general-purpose parallel computing platform and programming model that uses Nvidia GPUs in order to solve complex computational problems. CUDA comes with a software environment that can be used in the C/C++ programming language as a high-level API. CUDA is also supported by other programming languages, APIs and directive-based approaches, which include, but are not limited to, FORTRAN, DirectCompute, OpenACC.

The Ease of Learning

CUDA has a low learning curve for programmer familiar with C/C++, as its based on three key abstractions:

Hierarchy of thread groups
Shared Memories
Barrier Synchronization

Those three elements are exposed as a minimal set of language extensions, making getting into CUDA quite easy!

Using these abstractions CUDA provides data and thread parallelism at its core. Solving a problem using the GPU is as simple as partitioning the problem into sub-problems that can be solved independently in parallel by blocks of threads. Each sub-problem is then split futher into smaller pieces that can be solved cooperatively in parallel by all threads within the block.

GPU Architecture

GPUs are built around an array of Streamining Multiprocessors (SMs).

[Image 3]

SMs partition a multi-threaded program into blocks of threads, making GPUs with more multi-processors automatically execute programs faster than GPUs with fewer multiprocessors. Similarly, GPUs with more blocks and more threads in each block, also execute the highly-parallel programs much faster.

Nvidia GPUs have a number of CUDA cores, which basically means how many instructions can be executed per circle. How many threads per block and blocks in general the program should use depends on the application. CUDA has some limits per block, dimension etc. that also depend on the architecture and compute capability. In the end its just trial-and-error with such parameters in order to get the best results. There are of course some guidelines that should always be followed!

The thread and block hiearchy will be discussed deeply next time, where we will also write our first CUDA program!