Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



10
Load Balancing

While scheduling is involved with assigning tasks to processors, load balancing is the process by which elements of the data domain are assigned to processors, with the same two goals of maximizing the processors' utilization, and minimizing the total execution time. Just as scheduling can be done statically or dynamically, load-balancing can theoretically be accomplished statistically, that is before the code runs, or dynamically, while the application is running. In Chapter 9 we covered static data allocations: domain decomposition, functional decomposition and indexing. Load-balancing refers most often, though, to the dynamic distribution of data among the processors, and it is in this context that we will study it here.

10-1 Dynamic Load Balancing

An example

Imagine the following situation: we have an application for which the data domain is composed of N units, and we want to have it solved on a network of P processors. In addition, the data domain offers great flexibility in the way it can be decomposed. (The Mandelbrot set problem would be a good example of such a domain with the screen pixels offering may avenues for partitioning them into groups: horizontal lines, vertical lines, horizontal intervals, vertical intervals, or rectangular regions.) Assuming that P divides N evenly, each processor can start with an equal number of data units, as shown in Figure 10-1. Let's assume, furthermore, that the processors are arranged in a chain, with P1 serving as a host interface. Assuming that each processor is sending intermediary results to the host as soon as it obtains them, we have a situation where P1 must spend a large amount of time shuffling data from the other processors in the chain to the host. This involvement in data communication affects all the processors, and lessens as we move closer to PN. Hence, P1 will spend less time processing its own data than P2 does, which it turns, will spend less time than P3, and so on, until we reach PN which can devote all of its time computing and processing its own data. As a result PN will probably finish first[1] and run out of data before the others. We can then expect PN-1 to be the next to finish, and so on.

Distributing the work evenly among processors does not yield the best performance, but creates uneven processor utilization and a longer execution time than we had hoped for. Essentially the execution time of the application is the time required by P1 to finish. This is illustrated in Figure 10-2.

Static load balancing

Load balancing is a way to keep processor utilization as even as possible. Static load balancing would require us to estimate, either through profiling the program or via modeling, the utilization experienced by each processor in the network, and to assign loads inversely proportional to it, so that all processors would be guaranteed to finish at approximately the same time. This approach yields good performance, but unfortunately, profiling is not always easily accomplished, and modeling requires an intimate knowledge of the control structure of the application, of the properties of the data domain, and of the hardware (performance of the communication system, for example).



Centralized and distributed methods

Dynamic load balancing, although not as efficient as static load balancing, has the advantage of yielding competitive performance by providing a means for the processors to share their loads. Among the many different implementations of load balancing, two stand out: Centralized and distributed. The Manager-Workers method is a typical centralized load-balancing scheme used often with message passing multiprocessors. Full-exchange balance, on the other hand, represent one of many heuristics for distributed balancing that have been implemented [RANK88]. Let's get back to our example and study these different approaches separately.


Figure 10- 1: Decomposition of data domain into equal size data sets for each processor.


Figure 10- 2: The uniform distribution of the "load" to the processors creates a situation where the processors closer to the Host must spend a disproportionate amount of time in communication, which prevents them from keeping up with processors away from the host. At t1 all the processors have equal workloads of six units. At t2, PN has already diminished its load to 3, while P
1 still hasn't made a dent in his. At T3 PN has finished, while P1, P2, P3 still have a substantial work left.

Manager-
Workers

In the Manager-Workers implementation, one processor[2], has the special duty of maintaining the data domain and to distribute it to the other processors as they request more data. This processor plays the role of a manager supervising the workers and providing them with work. Going back to our original example, a system balanced by a Manager-Workers approach would look like that shown in Figure 10-3.


Figure 10- 3: Manager-Workers example. (a) First the Manager is given the whole data domain, and Workers send requests for work. (b) The Manager responds by distributing units of work to all Workers. (c) When a Worker finishes processing its unit, it sends another request to the Manager which responds by sending more work (d).

When the network of processors starts, one processor is designated as the Manager, and obtains the data domain. Because the domain often comes from the Host, or must be eventually collected and sent back to the host, the Manager is often the Root in a transputer network. The workers, as soon as they are initialized send requests to the Manager for work. This one responds by sending work units back to each worker. As time progresses, some workers will finish earlier than others and will request more work. By distributing the load evenly over the network as the computation progresses, the Manager ensures that all the processors maintain a high level of utilization.

This scenario is a general description of the typical relationship between a manager and its workers. Depending on the application at hand, different variants may be introduced.

Variants

In the Mandelbrot set application, for example, the Manager could send a number to a worker identifying which line or column of pixels of the video screen the worker must generate. Instead of sending a request, the worker could send back the set of pixels and their associated colors. Another option would be for the Manager to decompose the domain into small rectangles and, assuming that their height and width is constant throughout the computation, the Manager could simply send to the workers the coordinates of the upper left corner of the rectangles. Because the computation of the color of a pixel takes a different amount of time depending on its color, different areas of the screen will require different amount of computing time. Load balancing ensures that most processors given areas of "fast" colors will be given additional regions to work on while other processors are stuck with regions of "slow" colors.

Performance

The process we have described can be fairly inefficient if implemented directly. The reason is that it implicitly introduces latencies for the workers. To see this, let's follow the steps taken by a manager and one of its workers when this one requests a new unit of work.

Manager Worker
 
  • Send request
  • Get request
  • Send unit of work
 
 
  • Receive unit of work
  • Process unit of work
  • Send new request...
Table 10-1: Step-by-step description of data exchange and computation for a Manager-Workers system with no buffering.

The shaded areas above represent idle times for both the worker and the Manager. We refer to the waiting periods inflicted on the worker which must be idle while messages are exchanged as latencies. One of the main goals of the parallel processing on a message passing system is to hide latencies. A simple solution here is to use double-buffering for the worker. The idea is that the worker should have two buffers in which it keeps its work-units. When it is done with a unit, it can overlap the request for a new one by starting processing the second unit in the buffer.

The step-by-step decomposition of the same scenario as above but with double buffering is shown below.

Manager Double-Buffer QueueWorker
 
  • Send request
  • Get request
  • Send unit of work
  
 
  • Send new request
  • Get request
  • Send unit of work
  • Dequeue unit of work and start working on it
 
  • (Processing unit of work )
 
  • Done with unit of work
  • Send request for new unit
Table 10-2: Step-by-step description of data exchange and computation for a Manager-Workers system implementing double buffering.

With double buffering, the worker overlaps the transmission of the request for data and the reception of the next data unit with the processing of the current data item. Depending on the degree with which the overlap between computation and communication can be accomplish, the performance can be greatly affected. For example, if the time required to process the data is much longer than communication time, then we will be able to observe almost full overlap of communication and processing.

Additional variations of the Manager-Workers scheme presented here are feasible. It is possible, for example, to have the Manager initially distribute the load evenly to all the workers, and to poll them regularly to check on their progress. Workers that progress slowly can have their load lightened, while those advancing faster are given more work by the Manager. At the end of this chapter we will look at such a variant for the N-queen problem, where the Manager constantly shifts work from the most heavily loaded worker to the least loaded one.

Problem-heap paradigm
Divide and conquer algorithms

The Manager-Workers method can also be used for purposes other than load balancing. In the problem-heap paradigm, for example [MØLL87, GRØN91], the whole data domain can be decomposed into smaller problems, which, if they are simple enough, can be solved directly, or can be themselves decomposed into still smaller problems. Initially the Manager starts with a heap containing the whole problem. Depending on the implementation, the Manager can divide the heap into smaller problems that it starts feeding to the workers, or it can ask one of them to do the original break-down. Once a worker receives a problem, it either solves it and returns the result to the Manager, or it divides it into smaller sub problems. In this later case, it returns all of the sub problems back to the Manager (to the heap), except one that it keeps working on, recursively. This approach is ideally suited for divide-and-conquer algorithms implemented on small size multiprocessor networks. When the size of the network of multiprocessors becomes large, however, the frequent accesses to the heap result in heavy traffic, hot spots, and congestion. The solution in this case is to resort to a distributed system where the heap is divided into sub-heaps kept in the local memory of the multiprocessors. Because processors may reduce their heap at different rates, we are faced with the original problem we set out to investigate: that of balancing the load (heap) among the processors.


EXERCISES


 
10-1

Computing . The code below is a serial program for computing . Transform it into a parallel implementation based on the Manager-Workers paradigm. The Manager will divide the interval [0,1] into N smaller intervals, and will distribute them in the form of triplets of numbers (S,n,) to the workers. Upon receiving a triplet, a workers will compute the integral of

over the interval [S,S+n] by computing
.
Once the sum is computed, the worker will send the result back to the Manager, which will send a new triplet if more are available. Implement a single buffer scheme first, and measure the execution time of your program, along with the utilization of the transputers[3] for various values of N.

Implement a double-buffer version of the program and compare the execution times and utilization graphs of the two implementations. Explain the behavior of the performance curves you measured.

/* ======================================================================= 
        pi.c
        Computation of pi on a serial processor.

        Syntax:  pi low high NoPoints

		where low and high define the interval over which the 
		summation must be performed (normally 0 to 1), and
		NoPoints represents the number of terms in the sum.

       Example:  C:> pi 0 1 100
                 with 100 points, Pi = 3.141600986923122730  

 ====================================================================== */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define f(x) (4.0/(1.0+(x)*(x)))

main(int argc, char *argv[])
{
     double low, high;		/* interval bounds				*/
     long   NoPoints;		/* Number of terms in the sum (or slices)	*/
     double SliceSize;		/* size of a slice: (high-low)/NoPoints	*/
     double x;			/* running variable				*/
     double PartialSum=0;	/* running sum					*/

     /*--- get parameters ---*/
     low       = atof(argv[1]);
     high      = atof(argv[2]);
     NoPoints  = atol(argv[3]);
     SliceSize = (high-low)/NoPoints;

     /*--- add up terms ---*/	
     for (x=low; x<high-(SliceSize*0.5); x += SliceSize)
     	PartialSum += f(x+ SliceSize/2.0)*SliceSize;

     /*--- output result ---*/
     printf("with %ld points, Pi = %20.18lf  ",NoPoints, PartialSum);
}
.

 

 
10-2
Is Conway's game of life an attractive application for a Manager-Workers implementation? Why?


 

 
10-3
Implement a Manager-Workers implementation of the Mandelbrot set. Divide the display area into small square areas defined by the coordinates of their upper left corner, and by the length of their side: (x1,y1,l). These will become the work units distributed by the Manager to its workers. Workers will return an integer array corresponding to the color of the pictures in a square.

Do not display the pixels on the screen as they are generated, but instead make the Manager build the screen internally[4]. Measure the execution time and the processors' utilization (see Exercise 6-2) when single-buffering and double-buffering are used, as a function of the size l of the pixel squares.


 





[Previous] [HOME] [NEXT]