home page -> teaching -> parallel and distributed programming -> Lecture 7 - simple parallel algorithms

Lecture 7 - Simple parallel algorithms

Data-parallel vs task-parallel:

The most general approach: consider the dependency graph between computed quantities. It is a DAG (directed acyclic graph). Notes:

Simple data decomposition

Processing of an array of data can often be split into independent blocks.

The easiest case is when each input produces one output — the map pattern. See the example for computing the sum of two vectors: vector_sum_split_work.cpp.

Simple way of computing the boundary index:

  beginIdx = (threadIdx * nrElements) / nrThreads
  endIdx = ((threadIdx + 1) * nrElements) / nrThreads

However, beware of cache effects! Processing consecutive elements is significantly faster than processing every k-th element. Compare the previous program with the one at vector_sum_split_work_bad.cpp.

A more complex case arises when each output depends on a group of inputs, around the input at the same position — the stencil pattern. See vector_average_stencil.cpp.

It is preferrable to split on output than on inputs — so that each output is computed by exactly one worker (thread, task) and so no mutexes are necessary.

Recursive decomposition

The initial worker splits the data into two or more fragments, gives the fragments as inputs to subordinate workers, and finally it combines the results.

Example 1: Compute the sum of a vector. Create a binary tree of adders. The depth is O(log(n). Source code: recursive_decomposition_sum.cpp.

Example 2: Merge sort. The basic (non-parallel algorith) is to divide the input vector into two parts, merge-sort each part, then merge the resulting two sorted vectors into one. For parallelizing, merge-sorting the two parts can easily be done in parallel. However, the final merge is a bit harder. It can be done as follows:

See the C++ implementations:

Example 3: Compute the sequence of sums of prefixes. Given a0, a1, ..., an-1, compute b0 = a0, b1 = a0+a1, b2 = a0+a1+a2,..., bn-1 = a0+a1+a2+...+an-1.

Solution: start with a binary tree computing the sum of all numbers in the sequence. Then, compute each prefix sum from the largest parts already computed.

  // First, compute the sums of 2^j consecutive numbers;
  // b[i*2^j - 1] = a[(i-1)*2^j] + ... + a[(i-1)*2^j + 2^j - 1]
  b = a
  for(size_t k=1 ; k<n ; k = k*2) {
      for(size_t i=2*k-1 ; i<n ; i+=2*k) { // in parallel
          b[i] += b[i-k];
      }
  }
  // Then, compute each partial sum as a sum of 2^j groups:
  k = k/4
  for( ; k>0 ; k = k/2) {
      for(size_t i=3*k-1 ; i<n ; i+=2*k) { // in parallel
          b[i] += b[i-k];
      }
  }

Examples:

This mechanism can be used with any associative operation instead of the addition.

For example, it can be used for speeding up adding two binary numbers (in hardware). For each bit position, we produce a value that reflects what the carry does over that bit:
Abbrev What Possible input bits
S Sink 0+0
P Propagate 0+1 or 1+0
G Generate 1+1

A block of consecutive bits also generate, propagate, or sink the carry. The behavior is the result of the following compose operation (rows describe the behavior of the right (least significant) bit and columnt the behavior of the left (more significant) bit):
S P G
S S S G
P S P G
G S G G

It can be checked that the operation above is associative. So, the mechanism described above can be used to compute the behavior of the block of the rightmost k bits, for all values of k from 1 to the number of bits of the number.

Radu-Lucian LUPŞA
2025-11-06