Parallel Programming in C for the Transputer
© D. Thiébaut, 1995
10
Load Balancing
While scheduling is involved with assigning tasks to processors, load balancing is the process by which elements of the data domain are assigned to processors, with the same two goals of maximizing the processors' utilization, and minimizing the total execution time. Just as scheduling can be done statically or dynamically, load-balancing can theoretically be accomplished statistically, that is before the code runs, or dynamically, while the application is running. In Chapter 9 we covered static data allocations: domain decomposition, functional decomposition and indexing. Load-balancing refers most often, though, to the dynamic distribution of data among the processors, and it is in this context that we will study it here.
Imagine the following situation: we have an application for which the data
domain is composed of N units, and we want to have it solved on a
network of P processors. In addition, the data domain offers great
flexibility in the way it can be decomposed. (The Mandelbrot set problem would
be a good example of such a domain with the screen pixels offering may avenues
for partitioning them into groups: horizontal lines, vertical lines, horizontal
intervals, vertical intervals, or rectangular regions.) Assuming that
P divides N evenly, each processor can start with an equal
number of data units, as shown in Figure 10-1. Let's assume, furthermore,
that the processors are arranged in a chain, with P1 serving as a host
interface. Assuming that each processor is sending intermediary results to the
host as soon as it obtains them, we have a situation where P1 must spend
a large amount of time shuffling data from the other processors in the chain to
the host. This involvement in data communication affects all the processors,
and lessens as we move closer to PN. Hence, P1 will spend less
time processing its own data than P2 does, which it turns, will spend
less time than P3, and so on, until we reach PN which can devote
all of its time computing and processing its own data. As a result PN
will probably finish first[1] and run out of
data before the others. We can then expect PN-1 to be the next to
finish, and so on.
Distributing the work evenly among processors does not yield the best
performance, but creates uneven processor utilization and a longer execution
time than we had hoped for. Essentially the execution time of the application
is the time required by P1 to finish. This is illustrated in Figure
10-2.
Load balancing is a way to keep processor utilization as even as possible.
Static load balancing would require us to estimate, either through
profiling the program or via modeling, the utilization experienced by
each processor in the network, and to assign loads inversely proportional to
it, so that all processors would be guaranteed to finish at approximately the
same time. This approach yields good performance, but unfortunately, profiling
is not always easily accomplished, and modeling requires an intimate knowledge
of the control structure of the application, of the properties of the data
domain, and of the hardware (performance of the communication system, for
example).
Dynamic load balancing, although not as efficient as static load balancing, has
the advantage of yielding competitive performance by providing a means for the
processors to share their loads. Among the many different implementations of
load balancing, two stand out: Centralized and distributed. The
Manager-Workers method is a typical centralized load-balancing scheme
used often with message passing multiprocessors. Full-exchange balance,
on the other hand, represent one of many heuristics for distributed balancing
that have been implemented [RANK88]. Let's get back to our example and study
these different approaches separately.

Figure
10-
1:
Decomposition of data domain into equal size data sets for each
processor.
Figure
10-
2:
The uniform distribution of the "load" to the processors creates a situation
where the processors closer to the Host must spend a disproportionate amount of
time in communication, which prevents them from keeping up with processors away
from the host. At t1 all the processors have equal workloads of six units. At
t2, PN has already diminished its load to 3, while P1 still hasn't made
a dent in his. At T3 PN has finished, while P1, P2,
P3 still have a substantial work left.
In the Manager-Workers implementation, one processor[2], has the special duty of maintaining the
data domain and to distribute it to the other processors as they request more
data. This processor plays the role of a manager supervising the workers and
providing them with work. Going back to our original example, a system
balanced by a Manager-Workers approach would look like that shown in Figure
10-3.

Figure
10-
3:
Manager-Workers example. (a) First the Manager is given the whole data
domain, and Workers send requests for work. (b) The Manager responds by
distributing units of work to all Workers. (c) When a Worker finishes
processing its unit, it sends another request to the Manager which responds by
sending more work (d).
When the network of processors starts, one processor is designated as
the Manager, and obtains the data domain. Because the domain often comes from
the Host, or must be eventually collected and sent back to the host, the
Manager is often the Root in a transputer network. The workers, as soon as
they are initialized send requests to the Manager for work. This one responds
by sending work units back to each worker. As time progresses, some workers
will finish earlier than others and will request more work. By distributing
the load evenly over the network as the computation progresses, the Manager
ensures that all the processors maintain a high level of utilization.
This scenario is a general description of the typical relationship between a
manager and its workers. Depending on the application at hand, different
variants may be introduced.
In the Mandelbrot set application, for example, the Manager could send a number
to a worker identifying which line or column of pixels of the video screen the
worker must generate. Instead of sending a request, the worker could send back
the set of pixels and their associated colors. Another option would be for
the Manager to decompose the domain into small rectangles and, assuming that
their height and width is constant throughout the computation, the Manager
could simply send to the workers the coordinates of the upper left corner of
the rectangles. Because the computation of the color of a pixel takes a
different amount of time depending on its color, different areas of the screen
will require different amount of computing time. Load balancing ensures that
most processors given areas of "fast" colors will be given additional regions
to work on while other processors are stuck with regions of "slow" colors.
The process we have described can be fairly inefficient if implemented
directly. The reason is that it implicitly introduces latencies for the
workers. To see this, let's follow the steps taken by a manager and one of its
workers when this one requests a new unit of work.
| Manager | Worker |
| | |
- Get request
- Send unit of work
| |
| | - Receive unit of work
- Process unit
of work
- Send new request...
|
Table
10-1: Step-by-step description of data exchange and computation for a
Manager-Workers system with no buffering.
The shaded areas above represent idle times for both the worker and the
Manager. We refer to the waiting periods inflicted on the worker which must be
idle while messages are exchanged as latencies. One of the main goals of the
parallel processing on a message passing system is to hide latencies. A
simple solution here is to use double-buffering for the worker. The
idea is that the worker should have two buffers in which it keeps its
work-units. When it is done with a unit, it can overlap the request for a new
one by starting processing the second unit in the buffer.
The step-by-step decomposition of the same scenario as above but with double
buffering is shown below.
| Manager | Double-Buffer Queue | Worker |
| |  | |
- Get request
- Send unit of work
| | |
| |  | |
- Get request
- Send unit of work
|  | - Dequeue unit of work and start working on it
|
| |  | - (Processing unit of work )
|
|
| |  | - Done with unit of work
- Send request for new unit
|
Table
10-2: Step-by-step description of data exchange and computation for a
Manager-Workers system implementing double buffering.
With double buffering, the worker overlaps the transmission of the
request for data and the reception of the next data unit with the processing of
the current data item. Depending on the degree with which the overlap between
computation and communication can be accomplish, the performance can be greatly
affected. For example, if the time required to process the data is much longer
than communication time, then we will be able to observe almost full overlap of
communication and processing.
Additional variations of the Manager-Workers scheme presented here are
feasible. It is possible, for example, to have the Manager initially
distribute the load evenly to all the workers, and to poll them regularly to
check on their progress. Workers that progress slowly can have their load
lightened, while those advancing faster are given more work by the Manager. At
the end of this chapter we will look at such a variant for the N-queen
problem, where the Manager constantly shifts work from the most heavily loaded
worker to the least loaded one.
The Manager-Workers method can also be used for purposes other than load
balancing. In the problem-heap paradigm, for example [MØLL87,
GRØN91], the whole data domain can be decomposed into smaller problems,
which, if they are simple enough, can be solved directly, or can be themselves
decomposed into still smaller problems. Initially the Manager starts with a
heap containing the whole problem. Depending on the implementation, the
Manager can divide the heap into smaller problems that it starts feeding to the
workers, or it can ask one of them to do the original break-down. Once a
worker receives a problem, it either solves it and returns the result to the
Manager, or it divides it into smaller sub problems. In this later case, it
returns all of the sub problems back to the Manager (to the heap), except one
that it keeps working on, recursively. This approach is ideally suited for
divide-and-conquer algorithms implemented on small size multiprocessor
networks. When the size of the network of multiprocessors becomes large,
however, the frequent accesses to the heap result in heavy traffic, hot spots,
and congestion. The solution in this case is to resort to a distributed system
where the heap is divided into sub-heaps kept in the local memory of the
multiprocessors. Because processors may reduce their heap at different rates,
we are faced with the original problem we set out to investigate: that of
balancing the load (heap) among the processors.
EXERCISES
| |
10-1
|
Computing . The code below is a serial program for computing .
Transform it into a parallel implementation based on the Manager-Workers
paradigm. The Manager will divide the interval [0,1] into N smaller
intervals, and will distribute them in the form of triplets of numbers
(S,n,) to the workers. Upon receiving a triplet, a workers will
compute the integral of 
over the interval [S,S+n] by computing
.
Once the sum is computed, the worker will send the result back to the Manager,
which will send a new triplet if more are available. Implement a single buffer
scheme first, and measure the execution time of your program, along with the
utilization of the transputers[3] for
various values of N.
Implement a double-buffer version of the program and compare the execution
times and utilization graphs of the two implementations. Explain the behavior
of the performance curves you measured.
/* =======================================================================
pi.c
Computation of pi on a serial processor.
Syntax: pi low high NoPoints
where low and high define the interval over which the
summation must be performed (normally 0 to 1), and
NoPoints represents the number of terms in the sum.
Example: C:> pi 0 1 100
with 100 points, Pi = 3.141600986923122730
====================================================================== */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define f(x) (4.0/(1.0+(x)*(x)))
main(int argc, char *argv[])
{
double low, high; /* interval bounds */
long NoPoints; /* Number of terms in the sum (or slices) */
double SliceSize; /* size of a slice: (high-low)/NoPoints */
double x; /* running variable */
double PartialSum=0; /* running sum */
/*--- get parameters ---*/
low = atof(argv[1]);
high = atof(argv[2]);
NoPoints = atol(argv[3]);
SliceSize = (high-low)/NoPoints;
/*--- add up terms ---*/
for (x=low; x<high-(SliceSize*0.5); x += SliceSize)
PartialSum += f(x+ SliceSize/2.0)*SliceSize;
/*--- output result ---*/
printf("with %ld points, Pi = %20.18lf ",NoPoints, PartialSum);
}
. |
| |
10-2
|
Is Conway's game of life an attractive application for a
Manager-Workers implementation? Why?
|
| |
10-3
|
Implement a Manager-Workers implementation of the Mandelbrot set.
Divide the display area into small square areas defined by the coordinates of
their upper left corner, and by the length of their side:
(x1,y1,l). These will become the work units distributed
by the Manager to its workers. Workers will return an integer array
corresponding to the color of the pictures in a square.
Do not display the pixels on the screen as they are generated, but instead make
the Manager build the screen internally[4].
Measure the execution time and the processors' utilization (see Exercise 6-2)
when single-buffering and double-buffering are used, as a function of the size
l of the pixel squares.
|