Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



8-2 Processor Utilization

The next measure of performance is the processor utilization. This measure is applied to all the processors active to solve a given application, and reports how much work each one contributes. Assuming that N processors are used by a parallel program, the utilization of Processor Pi is defined as:

where ComputeTime(Pi) represents the amount of time spent by the processor doing actual work, and IdleTime(Pi) represents the amount of time during which the processor was not doing anything. This could represent time during which all tasks are blocked awaiting a data transfer to complete, for example.

Processor utilization is a very intuitive measure, and the only subtlety about it is how the compute and idle times are computed. A quick example will illustrate this point. Assume that we have two processors/transputers involved in a parallel application. They both contribute to the total computation and exhibit periods of activity and idleness as shown in Figure 8-1a. Note that Processor 1 executes longer than Processor 2. Figure 8-1b shows the same information, but as a stacked bar graph where each bar represent the total amount of compute time (darker) and idle time (lighter) spent by each processor towards the completion of the parallel application. The ratio of the height of the darker bar to the height of the stacked-bar for both processors is shown in Figure 8-1c. This third diagram shows the utilization of each processor. It is important to note that the utilization of Processor 2 is computed using the actual time Processor 2 is involved in the computation, and not using the execution time reported by Processor 1, which is also the one observed by the host processor.

Although this distinction may seem arbitrary, it becomes important in a multi-user environment, as Processor 2 may be "harvested" to participate in some other computation, and hence, its utilization should be computed against the actual time it stayed attached to the application being studied.


Figure 8- 1: Two-transputer example. (a) Timing diagram of execution of each transputer, showing computation (useful) time and idle time. (b) Bar graph showing total time as combination of computation time plus idle time. (c) Utilization diagram showing ratio of computation time to total time.

Let's go back to our Mandelbrot example and see how efficiently we utilized the transputers. Table 8-1 below reports on the different execution times recorded for six transputers running the Mandelbrot set example. The first column represents the total execution time reported by each transputer at the end of the program. When the Mandelbrot set is displayed, each transputer sends to the root a packet containing several of the statistics shown here. These statistics do not include the creation of these packets, nor do they include the time required to create the packets. This explains why the Total time column shows the Root as working slightly less than the other transputers. The "1st phase" column represents the time at which a transputer finishes displaying its part of the Mandelbrot set. The "2nd phase" column represents the time it spends afterwards, relaying messages from transputers on its right to the Root down to its left. The "idle 1" column represents the amount of idle time recorded during Phase 1, while the "idle 2" column represents the amount of idle time in the second phase. "Total idle" represents the sum of the previous two numbers. The last three columns represent the processor utilization, broken down into total utilization over the whole application, utilization during the first phase only, and utilization during the second phase.

       Execution times (seconds)                                    Utilizations (%)
Proc   Total   Phase   Phase   Phase   Phase   Total   Total   Phase    Phase 2
 Id    time    1       2       1       2 Idle  idle            1
                               Idle
  1    41.76   23.98   17.77   7.56    13.85   21.42   48.98   68.46    22.06
  2    41.92   27.38   14.54   12.54   12.44   24.98   40.50   54.20    14.44
  3    41.91   32.07   9.84    16.60   8.42    25.01   40.41   48.25    14.46
  4    41.89   36.47   5.43    20.36   4.64    25.00   40.45   44.17    14.45
  5    41.87   41.87   0.00    25.02   0.00    25.02   40.40   40.24    NA
  6    40.93   40.93   0.00    24.27   0.00    24.27   42.18   40.70    NA

Table 8- 1: Execution times and utilization values for Mandelbrot Set calculation on six transputers. NA entries indicate an undefined utilization (Phase 2 = 0 sec).

The graphic representation of the execution times spent by each processor in each phase is truly indicative of the way the computation is organized, as shown in the left-most diagram of Figure 8-2.


Figure 8- 2: Execution times and utlizations associated with the two-phase computation of the Mandelbrot Set on six transputers.

We note that each transputer spends almost exactly the same amount of time running the application (first column contains almost identical numbers). The first and second phases, however, differ widely as a function of the processor Id. The smaller the _node_number, the shorter Phase 1 is, and the longer Phase 2 is. In other words, the Root is the first one to finish displaying its columns of pixels, confirming our comments about Figure 8-Error! Bookmark not defined. that the rightmost wave belonged to the Root. This is a result of the way the ProcAlt statement grants channel requests (see Exercise 8-1), favoring the local channel over the hard one.

It is interesting to see that the amount of idle time during Phase 1 is also proportional to _node_number. Obviously, this is linked to the fact that the Root is the first to finish its part of the Mandelbrot set, since each transputer must compute the exact same number of pixels. In phase 2 the transputers act as relays for their rightmost neighbors, and since they have no real computation to perform, the Idle 2 column reflects the waiting times between channel operations.

Another surprising observation is that the sum of the idle times over the whole computation is roughly the same for all six transputers. The breakdown in two phases clearly sheds more light on the way the transputers cooperate than if we had simply measured the total compute and idle times.

The overall utilization is about the same for all six transputers, and ranges between 40.40% and 42.18%. In other words, each transputer wastes almost 60% of its power. This agrees with the speedup value of 2.36 that we measured earlier. A necessary, but not sufficient, condition to reach a speedup of 6 with six transputers is that all transputers be utilized 100%. If we only harvest 40% of each transputer, then we can only expect to have a speedup as good as 6 x 0.40=2.40, which matches the value of 2.36 which we measured earlier.


EXERCISES


 
8-3

This exercise deals with LS C techniques to measure accurately the execution time and idle time of a transputer program. Measuring the execution time of a program, or section of code on a transputer can be easily accomplished by a call to the Time() function. This function returns the current value of the timer associated with the current priority level. When an application uses mixed-priority tasks, however, calls to Time() may not return compatible values.

How can one devise a function that will always return a consistent time, independently of the priority of the task that called the function? What is the accuracy of the function you just devised?

.


 

 
8-4
Measuring idle time in a transputer is more of a challenge than measuring straight time. The problem is that if the transputer is idle, then it is not doing anything, and if it is not doing anything, how can we make the transputer record how much of "nothing" it is doing?

One solution is to make each transputer create a low priority task as soon as the application starts. This task is low priority and very simply coded:

void IdleTask(Process *P, int *IdleCount)
{
        IdleCount=0;
        while (1)
{ (*IdleCount)++; ProcReschedule(); } }

Write an LS C program that does nothing but launch the IdleTask, lets it run for 1 second, and then display the value of IdleCount. What does the value displayed represent? How can it be used to compute the idle time experienced by a transputer? Give an example including a timing diagram showing how adding the IdleTask to a program can be used to get a fairly accurate measure of the inactivity of the processor.


 

 
8-5

Implement the sorting algorithm presented in Exercise 8-2, and measure the processor utilization obtained. .


 


8-3 Processor Efficiency

The efficiency is defined as the average contribution of the processors towards the global computation. It is defined as the speedup divided by the number of processors:

where Speedup(N) refers to the speedup measured on N processors. The Mandelbrot set example yields the following efficiencies:

  Number   Speedup    Efficienc  Speedup    Efficienc  Speedup    Efficienc
   of      (64-ppp)   y (%)      (240-ppp)  y (%)      (1 ppp)    y (%)
Processor
    s
     1     1.00       100.00       0.98     98.00       0.83      83.00
     2     1.64       82.00       1.38      69.00       1.23      61.50
     3     2.01       67.00       1.59      53.00       1.45      48.33
     4     2.27       56.75       1.62      40.50      1.62       40.50
     5     2.43       48.60      1.62       32.40      1.61       32.20
     6     2.36       39.33       1.63      27.16       1.61      26.83

Table 8-2: Efficiency of processors when Mandelbrot Set is computed on linear chains ranging from 1 to 6 transputers.

We see the efficiency of each processor degrading as the number of transputers increases. Clearly the application as written is not scaling well and each additional transputer lowers the efficiency dramatically. (One obvious reason is the poor choice of a chain network.)

Efficacy

An other measure of performance similar to the efficiency is the efficacy [GHOS91], useful for determining the best cost-effective number of processors for a particular application.

An important property of the Efficacy curve when plotted as a function of the number of processors N, is that its first maximum corresponds to an optimal system operating point [EAGE89]. When a processor can be added to the computation with a resulting increase in efficacy, then the gain of the extra processor out measures the cost of adding it, where the cost is defined as the inverse of the efficiency. When, on the contrary, the efficacy diminishes, then the cost of the addition outweighs the potential performance gain. When applied to our example, we find that for 1 pixel per packet (ppp), the maximum efficacy achievable is 0.761933 for 2 transputers. For 64 ppp, it is 1.355165, also attained for 2 transputers, while for 240 ppp, the maximum of 0.973942 is reached only for 1 transputer.


EXERCISES


 
8-6

Implement the parallel sorting algorithm presented in Exercise 8-2, and compute the processor efficiency obtained with your implementation on a chain of transputers. What is the optimal number of transputers for your implementation?

.


 

 
8-7
The efficiency can also be computed to compare two different parallel machines. For example, if you implement the Mandelbrot Set example included on the distribution disk, you will probably find that the execution times you measure are different from the ones reported in this chapter. Three main reasons may cause this difference. The first one is the rate of the clock signal fed to the transputers in your system. The second likely reason might be a different rate of the signal controlling the communication over the serial links. Finally, the performance of the host system may affect the total execution time, especially for an application such as the Mandelbrot Set where more than 300,000 pixels must be individually displayed. The characteristics of the Host-Transputer system used to generate the data used in this chapter are 20-MHz transputer clock, 20 Mb/s data transfer rate, and a 66 MHz 486-DX PC compatible with a super-vga graphics accelerator card. Referring to this system as System A, and to your system as System B, compute the relative efficiency of A versus B as follows:

where NA and NB refer to the number of processors in System A and System B, respectively. Use NA = 6 for this example.


 


8-4 Packet Throughput

The next measure of performance worth studying is the packet throughput, which is a measure characterizing the effectiveness of the communication structure to route information packets from sources to destinations. It shouldn't be confused with the throughput defined as the number of jobs per second that the parallel machine is capable of completing. This second measure is associated with time-sharing systems. The one we are interested in is defined as the average number of information packets generated per unit time, and can be found expressed as bits/sec, bytes/sec, or packets/sec in cases where the packets are closely related to the computation granularity.

where T is the time during which the packets were generated. An easy way to measure the throughput is to make each transputer maintain a count of the number of bytes transferred over hard channels, and to measure the time duration over which the packets are recorded. For the Mandelbrot set example, the counting is unnecessary, as we know ahead of time how many pixels each transputer is in charge of generating, and how long the program runs for. The picture shown in Figure 8-3 is captured from a VGA screen of 640*480 = 307,200 pixels. Since the transputer exchanges information over packets of K pixels, the total number of information packets generated is 307,200/K. Each transputer sends an extra packet at the end of its execution to indicate to its left neighbor that it and all other transputers on its right are done. Therefore, if N transputers are involved in the computation, we have a total of 307,200/K+N packets generated.

Figure 8-3 shows the throughput in bytes/sec of the three different implementations, at 1, 64 and 240 pixels per packet.

Figure 8- 3: Throughput recorded for Mandelbrot Set example, for 1 to 6 transputers, and for 3 different packet sizes.

Note that since the amount of information exchanged in all three cases is about the same, the throughput bars for the 64 ppp case looks identical to the speedup curve of Figure 8-Error! Bookmark not defined.. This shouldn't be surprising, though, that measures that have different definitions may have similar graphs. We are simply viewing the same experiment from different directions. Each view relates to the same story, but presents a different angle which can be used to further the analysis of how well or bad the parallel application behaves on the network of transputers.

For example, Figure 8-3 shows that 64 ppp offers the higher throughput of the three packets sizes studied, and that the highest throughput is given by a chain of 5 transputers, which we already knew from the speedup curves. We also see that the largest increase in throughput occurs when we switch from 1 to 2 transputers, which is what the efficacy analysis revealed. So throughput, speedup, and efficacy are closely related to one another. We couldn't have predicted much about the processor utilization, however, indicating that throughput and utilization are less directly related.


EXERCISES


 
8-8

The throughput that we have defined is an average computed over the total execution time. If more details are needed about the behavior of the program, we can refine the analysis by plotting a more "instantaneous" variation of the throughput. Doing so requires knowing when the transputers are active, when they are generating packets, and to plot this information as a function of time. Using the information listed in Table 8-1, generate a chart representing the variation of the throughput as a function of time in the chain of 6 transputers. .


 


8-5 Other performance metrics

We have just scratched the surface. Measuring the performance of an application running on a parallel machine can be done using many different metrics, each targeted to a particular aspect of the machine or of the application. When the attention is on the routing of messages, for example, then the mean delay experienced by a packet as it travels the network may be more meaningful than the speedup or processor utilization. As a result, a rich collection of metrics exists, most of them specialized to depict a precise behavior of the application/machine combination.

The performance measures we presented in this chapter provide tools general enough to yield accurate measures of goodness.

benchmarks

It is interesting to note that, just as the different performance measures provides views at different angles on the behavior of a program on a given machine, the conclusions that we may infer from the resulting observations are still very biased in some sense. The reason is that the application that we selected is just one among several hundreds of other problems that could have been chosen, and if one is interested in evaluating the general figures of merit of a multiprocessor system, then we should do so with a collection of applications, as diverse as possible. Such collections are called benchmark suites. A benchmark is a particular application or program that has been chosen for its qualities as either representing a type of decomposition or computation common to many other programs, or known to stress a particular aspect of the multiprocessor under study. Benchmarks and suites are just as numerous as the performance metrics are, and are specialized to expose particular architecture characteristics. A good survey of benchmark suites can be found in [LEWI92].




[Previous] [HOME] [NEXT]