[Image 1]
Today we continue with the Parallel Programming series about the OpenMP API. I highly suggest you to go read the previous articles of the series, that you can find by the end of this one. The topic of this article is how we specify that SIMD Instructions should be used using OpenMP!
So, without further ado, let's get straight into it!A parallel region is defined using a parallel region construct:
#pragma omp parallel [clause ...]
{
/* This code runs in parallel */
}
Parallel For Loops are defined by using the following syntax:
int i;
#pragma omp parallel for private(i) [clause ...]
for(i = 0; i < N; i++){
...
}
Non-iterative work-sharing is defined using a sections construct with nested section directives:
#pragma omp sections [clause ...]
{
/* run in parallel by all threads of the team */
#pragma omp section
{
/* run once by one thread */
}
#pragma omp section
{
/* run once by one thread */
}
...
}
When only one section inside a parallel region has to be executed by one thread then we can use the single construct:
#pragma omp single
Critical Sections are specified using the following construct:
#pragma omp critical [[(name)] [ hint (hint-expr)]] [clause ...]
{
/* critical section code */
}
Atomic operations are defined as:
#pragma omp atomic [read | write | update | capture] [hint(hint-expr)] [clause ...]
/* statement expression*/
OpenMP also offers Routines for Lock Management of omp_lock_t general-purpose locks:
#pragma omp task [clause ...]
{
/* code block */
}
Implicit Barriers are placed at the end of each parallel section (when no nowait clause has been specified).#pragma omp barrier
In order to force memory consistency, the flush directive is used:
#pragma omp flush [(list)]
Let's start off with the basics. SIMD means Single Instruction on Multiple Data, and can be thought of as Vector Parallelism. In other words, the processors implement special instructions that perform the same calculation on multiple values at once. This is commonly used in GPUs, but can also speed up calculation on CPUs. Most modern CPUs contain AVX instruction extensions in them!
To understand it better let's get into an example.
Let's say we have two one-dimensional vectors of size N, A and B. Our Processing Unit has N different computing units for a specific operation (a vector processing unit for N-sized vectors). An one-dimensional vector of size N, C, is the result of that operation, for each corresponding Ai and Bi.
So, mathematically each element Ci is calculated as follows:
As an SIMD Instruction, this becomes:
and so all elements, Ci, are calculated in parallel.
If the vector was larger, let's say M, then it would be split into chuncks of size N, so that the operations can be applied in parallel. Each of these vector instructions is padded with No-Op's when no data is available.
So, how do we specify that SIMD Instructions should be used to implement a loop?
In order to indicate to the compiler that a for loop should be transformed into SIMD Instructions, we use the - you guessed it - simd directive:
#pragma omp simd [clause]
/* for loops */
The behavior of the simd directive can be tweaked using the following clauses:
There is also a directive that can be used to specify that a complete function definition (or declaration) should be implemented using SIMD Intructions, so that it can be called from an SIMD loop.
The directive for that purpose is declare simd:
#pragma omp declare simd [clause]
/* function definition or declaration */
This directive can be tweaked using the simdlen, linear and aligned clauses as well as:
Lastly, OpenMP can also combine the for and simd directive together, using the shortcut:
#pragma omp for simd [clause]
/* for loops */
Using this directive we specify a for loop that executes concurrently using SIMD instructions, with those iterations being executed in parallel by threads in a team.
All clauses that are accepted by simd and for directives can be used to tweak the for simd shortcut.
Let's multiply to large vectors element-wise using:
limits.h) each!The total ram usage is therefore about 26 Gigabytes!
Don't worry I have 32 gigs - he said...*Behind the scenes*
Well, that was unexpected - he said...Chrome is a ram-eater though, I was already at about 5 Gigabytes without running the code.
To calculate the execution time, we will use the omp function omp_get_wtime(), which gives more accurate results. :)
The vectors can be defined as pointers to unsigned short and then allocated using calloc:
#define N UINT_MAX
...
unsigned short *A, *B, *C;
/* allocate memory */
A = (unsigned short *) calloc(N, sizeof(unsigned short));
B = (unsigned short *) calloc(N, sizeof(unsigned short));
C = (unsigned short *) calloc(N, sizeof(unsigned short));
...
The vectors A and B are filled with random values of 1 byte (0 to 255), so that the multiplication doesn't cause any overflow (255 * 255 = 65025 < 216 - 1 = 65535):
srand(time(NULL));
for(i = 0; i < N; i++){
A[i] = rand() % 256;
B[i] = rand() % 256;
}
The element-wise multiplication of A[i] with B[i] results in C[i] and so can be described by the following for loop:
for (i = 0; i < N; i++){
C[i] = A[i] * B[i];
}
In order to use simd instructions instead of "generic" instructions, we add an simd directive before the loop:
#pragma omp simd
for (i = 0; i < N; i++){
C[i] = A[i] * B[i];
}
Note that the simd is not marked as parallel and thus the execution is basically still sequential!
This case is also perfect for parallel for loops, and so the loop can be put after a parallel for construct with the loop counter i set as private:
#pragma omp parallel for private(i)
for (i = 0; i < N; i++){
C[i] = A[i] * B[i];
}
Lastly, its also possible to define a parallel for simd construct, which optimizes the loop by executing simd instructions:
#pragma omp parallel for simd private(i)
for (i = 0; i < N; i++){
C[i] = A[i] * B[i];
}
The console outputs:
We conclude that serial processing is by far the slowest. SIMD somewhat speeds up the result, but not by much in the case of this program. And Parallel For or Parallel For SIMD is basically the same, because OpenMP automatically detects that the instructions should be SIMD instructions in this simple case. Either way, combining Parallel For with SIMD Instructions (the parallel for simd construct) is a great way of optimizing our programs, when it can be applied. It might have been some milliseconds in this program, but this could translate into minutes or even hours in complicated algorithms. And also, why should we not use the vector-specific processing units that the CPU offers?
Keep on drifting!