[Image 1]
Today we continue with the Parallel Programming series about the OpenMP API. In the previous articles I gave an Introduction to OpenMP and also covered how we define Parallel Regions. Today we will get into how we define Parallel For Loops.
So, without further ado, let's get straight into it!OpenMP is a parallel programming API thats meant for use in multi-threaded, shared-memory parallelism. Its built up of Compiler directives, Runtime library routines and Environmental variables.
The definition of a compiler directive in the C/C++ implementation of OpenMP is as follows:#pragma omp directive-name [clause ...]
OpenMP programs can be compiled using the GNU compiler by adding the "-fopenmp" flag:gcc -o output-name file-name.c -fopenmp
g++ -o output-name file-name.cpp -fopenmp
A parallel region is defined using a parallel region construct:
#pragma omp parallel
{
/* This code runs in parallel */
}
and can further be configured using clauses:
In the previous article we learned how to define parallel regions. While inside of such a region we can add additional directives like a for loop construct to turn a normal for loop into a parallel for. Of course not all loops can be parallelized that easily and data dependency along the various iterations is quite important to take care of. Furthermore, never forget that the portion of the iterations that each thread of the current team handles may cause the iterations to be executed in an arbitrary order.
Using the for loop construct while inside of a parallel region, we basically specify that the iterations of the associated loop (or loops) will be executed in parallel by the threads in the team that has already been created.
The defition of a for loop can be as simple as:
#pragma omp for
for(...){
...
}
When defined outside a parallel region the loop executes in serial on a single thread. Else, each thread is assigned to different sections of the loop.
To define a parallel for loop directly we can use the following syntax:
#pragma omp parallel for
for(...){
...
}
Let's also note that even though the newest C versions allow the loop variable (mostly int i) to be declared inside the loop body, because we have to explicitly set the loop variable as private, we can't use that syntax with OpenMP.
Therefore, a parallel for loop has to be declared in the following generic way:
int i;
#pragma omp parallel for private(i)
for(i = 0; i < N; i++){
...
}
Of course, as with the parallel construct, OpenMP allows us to configure the for loop to our needs using:
Let's start of with scheduling.
With the schedule clause we define how the iterations should be divided among the threads.
The format of a schedule clause is:
schedule(kind [, chunk])
where:
The kind (or type) of scheduling can be defined as:
To understand how parallel for loop definitions work, and how we can use some of the clauses to configure them, let's get into some examples...
To learn to fly you first have to learn how to walk, and how to make a simple parallel for loop - very important step(!). So, let's first make a simple for loop that calculates the dot product of two matrices.
Let's:
#define directives)The Code for that is:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define PARALLELISM_ENABLED 1
#define M 800
#define N 800
#define P 800
int main(){
/* declare integer arrays */
int A[M][N], B[N][P], C[M][P];
/* loop counters */
int i, j, k;
/* Calculation time */
time_t start_t, end_t;
double total;
/* initialize arrays */
srand(time(NULL));
for(i = 0; i < M; i++){
for(j = 0; j < N; j++){
/* random value in range (0, 100] */
A[i][j] = 1 + rand() % 100;
}
}
for(i = 0; i < N; i++){
for(j = 0; j < P; j++){
/* random value in range (0, 100] */
B[i][j] = 1 + rand() % 100;
}
}
for(i = 0; i < M; i++){
for(j = 0; j < P; j++){
C[i][j] = 0;
}
}
/* calculate dot product with parallel for loop */
start_t = time(NULL);
#pragma omp parallel for private(i, j, k) if(PARALLELISM_ENABLED)
for(i = 0; i < M; i++){
for(j = 0; j < P; j++){
for(k = 0; k < N; k++){
C[i][j] += A[i][k] * B[k][j];
}
}
}
end_t = time(NULL);
/* Calculate and print Calculation time */
total = difftime(end_t, start_t);
printf("Loop execution took: %.2f seconds (parallelism enabled: %d)\n", total, PARALLELISM_ENABLED);
return 0;
}
Running the code twice and tweaking PARALLELISM_ENABLED we get the following results on my system:Loop execution took: 3.00 seconds (parallelism_enabled: 0)
Loop execution took: 0.00 seconds (parallelism_enabled: 1)
Its worth noting that time(NULL) isn't that accurate, but I had problems getting clock() to work.
Anyway, we conclude that the parallel execution is much faster.
The firstprivate clause specifies that a private variable should be initialized to the value of the shared one (in private the value would be unspecified), whilst the lastprivate clause stores the last iteration's value of the private variable back into the shared one.
Let's declare an integer x and initialize it to 5. Then let's try and modify its value when x is in the private, firstprivate or lastprivate clause's list.#include <omp.h>
#include <stdio.h>
int main(){
int i, x;
/* Tweak shared variable */
x = 5;
printf("before parallelism, x: %d\n\n", x);
/* private x */
printf("private(x):\n");
#pragma omp parallel for private(i, x)
for(i = 0; i < 10; i++){
if(i == 0){
printf("thread %d, before tweak, x: %d\n", omp_get_thread_num(), x);
x = 0;
printf("thread %d, after tweak, x: %d\n\n", omp_get_thread_num(), x);
}
}
printf("after private(x), x: %d\n\n", x);
/* first private x */
printf("firstprivate(x):\n");
#pragma omp parallel for private(i) firstprivate(x)
for(i = 0; i < 10; i++){
if(i == 0){
printf("thread %d, before tweak, x: %d\n", omp_get_thread_num(), x);
x = 0;
printf("thread %d, after tweak, x: %d\n\n", omp_get_thread_num(), x);
}
}
printf("after firstprivate(x), x: %d\n\n", x);
/* last private x */
printf("lastprivate(x):\n");
#pragma omp parallel for private(i) lastprivate(x)
for(i = 0; i < 10; i++){
if(i == 9){
printf("thread %d, before tweak, x: %d\n", omp_get_thread_num(), x);
x = 0;
printf("thread %d, after tweak, x: %d\n\n", omp_get_thread_num(), x);
}
}
printf("after lastprivate(x), x: %d\n", x);
return 0;
}
Running the Program we get the following output:Using the collapse clause we can define how many of the nested for loops should be parallelized. In the case of the matrix dot product calculation we can use up to collapse(3), but a value of 2 gave far better results. Using the schedule clause we can easily change the iteration scheduling. Let's use guided here. There is not much difference in this simple calculation though.
So, the directive for the for loop changed into:#pragma omp parallel for private(i, j, k) if(PARALLELISM_ENABLED) collapse(2) schedule(guided)
The rest stays the same!
The result is also - well - 0.00 seconds again, but feels closer to 0 than 1 second when executing. The previous program more often gave 1.00 in some executions.
Parallelization doesn't always offer such a dramatic improvement as the one we saw with the matrix dot product.
Sometimes you are better of running an algorithm in serial fashion.
There is a whole theory behind how we should write parallel loops to get the best result, and some algorithms might not be parallelizable at all!
I might get into that later on, in even more advanced examples, but remember that just putting a #pragma omp parallel for might even destroy your algorithm.
There could be data dependencies along the various iterations, and critical sections and the need of synchronization, in order to keep the result the same as the serial algorithm.
Atomic operations and critical sections in general will be covered in a follow-up article!

Keep on drifting!