Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



Tuning


Granularity

Tuning a parallel program starts when it is written, debugged, and when the programmer's effort can be shifted from the development to the performance. There are many levels at which this tuning can take place. The granularity of the data blocks exchanged between processors is an important element that can influence the performance widely. When the data blocks are large, there is less time spent in the overhead of data communication and more time spent on the computation. However, the large data blocks often require computation times proportionally larger, and lower levels of parallelism might be achieved. On the other hand, when the data blocks are small, latencies can be more easily absorbed as short communication bursts can be overlapped with computation. The optimal size of the data block requires a careful modeling of the parallel application on the multiprocessor system, or tuning.

Interval between updates

For our purpose, tuning can be applied to finding the best interval of time between load-balancing periods. We saw in the previous section that the effectiveness, and hence the performance of the load-balanced application was dependent on the heuristic used to balance the load, and on the update interval. Because the Robin-Hood heuristic we chose here always brings back the two nodes that support the extreme loads, it must be run often to make sure the gap between these nodes (possibly different every update) does not increase. Tuning can take the form of a simple experiment where the parameter of interest is defined at run time. Figure 10-6 shows the result of measuring the execution time of the N-queen problem for N=27 and various interval lengths.


Figure 10- 6: Tuning the
N-queen problem for N=27, and intervals ranging from 500 sec. to 50 sec. Each execution time is the average of two runs.

We see that a 5 ms interval length provides the best execution time, 250.94 seconds, compared to 295.89 seconds for the asymptotic bounds for longer delays.


EXERCISES


 
10-4

The tuning presented above shows how we optimized the program for one value of N: N=27. We didn't optimize the program for a range of values of N. Tune the program for all the values of N between 12 and 28, and tabulate the results. (Run the program several times for each (N,delay) pair, and record the average execution time.)

                           N          Best update delay
                          12          ?
                          ...
                          28          ?

Can you find a relationship between N and the best delays? If so, express it as a function of N that can be implemented in C, and implement it in the N-queen program so that the program selects the best delay as a function of N automatically (if you can't find a mathematically representable relationship, use a look-up table). Verify that your program yields good performance by running it several times for each value of N in the range 12 to 28.

.


 

 
10-5
The right-hand side of the curves in Figure 10-2 seem to indicate that it is faster to find the first solution for a board of size N=2n+1 than for a board of size N=2n. If this is the case, and if the solution happens to have the first queen in the leftmost column of the top row, then we have the unexpected result that a solution for a problem of size 2n can be found faster if we first consider the problem of size 2n+1. The requirement that the first queen be located in the leftmost column allows us to simply throw it away, and so discarding the first row and first column. The remaining board area has size N by N, and contains one queen on each row and each column, forming a solution of size N. This property was noticed by Stone and Stone [STON86] in their work on the N-queen problem, when the next position of the queen is not restricted to be in the next row (see Exercise 9-3), and explored in details by Rao and Kumar [RAO93].

Modify the program so that it first tries to find a solution where the first queen is in the top left corner of a board of size N by N, when N=2n+1, and investigate if such is the case for all reasonable values of N (i.e. values that will not require an unreasonable amount of time on your transputer system). Can solutions for N=2n be found just as fast when the first queen is forced to be in the upper left-corner as in the general case?

Tune your program so that it automatically detects the parity of N and selects the fastest method to find a solution for that board size. Test your program over a reasonable range of values of N and show the improvement brought by your tuning.


 

 
10-6
Implement the load-balancing scheme that was introduced in Figure 10-Error! Bookmark not defined.. This second method of load-balancing is not centralized and requires near-neighbor balancing. First connect all the nodes by a virtual ring and implement a load-balancing where nodes exchange information with their left and right neighbors. Measure the performance of your program for N ranging from 12 to 27, for an average balancing delay of 1 ms. Tune your program in terms of its load-balancing delay and plot the resulting best execution times as a function of N.

Now change the virtual network from a ring to a toroidal mesh (or a to some other two-dimensional networks in case you do not have a sufficient number of transputers, and repeat the same steps.

Compare the results obtained with a ring and with a toroidal mesh.


 

 
10-7
Implement the heuristic introduced in Figure 10-4 and compare it, in terms of performance, to the Robin-Hood scheme.

 


10-3 Concluding Remarks

Load balancing closes our two-chapter exploration of problem decomposition. Load balancing is a general technique, and our presenting it in a specific context should not prevent you from using it in other situations where it can be adapted to maintain a uniform processor utilization. This is indeed the ultimate goal of load balancing and of methods derived from it: keeping the processor busy computing intermediate results contributing to the general solution. The performance of the system will come from a well-balanced technique where processors never starve and spend very little overhead time in the balancing itself.

Its used should also incorporate some of the techniques we saw in previous chapters to maintain high efficiency, such as double buffering, allowing processors to continue doing useful work when packets of work are shifted through the network.




[Previous] [HOME] [NEXT]