where ComputeTime(Pi) represents the amount of time spent by the processor doing actual work, and IdleTime(Pi) represents the amount of time during which the processor was not doing anything. This could represent time during which all tasks are blocked awaiting a data transfer to complete, for example.
Processor utilization is a very intuitive measure, and the only subtlety about it is how the compute and idle times are computed. A quick example will illustrate this point. Assume that we have two processors/transputers involved in a parallel application. They both contribute to the total computation and exhibit periods of activity and idleness as shown in Figure 8-1a. Note that Processor 1 executes longer than Processor 2. Figure 8-1b shows the same information, but as a stacked bar graph where each bar represent the total amount of compute time (darker) and idle time (lighter) spent by each processor towards the completion of the parallel application. The ratio of the height of the darker bar to the height of the stacked-bar for both processors is shown in Figure 8-1c. This third diagram shows the utilization of each processor. It is important to note that the utilization of Processor 2 is computed using the actual time Processor 2 is involved in the computation, and not using the execution time reported by Processor 1, which is also the one observed by the host processor.
Although this distinction may seem arbitrary, it becomes important in a multi-user environment, as Processor 2 may be "harvested" to participate in some other computation, and hence, its utilization should be computed against the actual time it stayed attached to the application being studied.
Let's go back to our Mandelbrot example and see how efficiently we utilized the transputers. Table 8-1 below reports on the different execution times recorded for six transputers running the Mandelbrot set example. The first column represents the total execution time reported by each transputer at the end of the program. When the Mandelbrot set is displayed, each transputer sends to the root a packet containing several of the statistics shown here. These statistics do not include the creation of these packets, nor do they include the time required to create the packets. This explains why the Total time column shows the Root as working slightly less than the other transputers. The "1st phase" column represents the time at which a transputer finishes displaying its part of the Mandelbrot set. The "2nd phase" column represents the time it spends afterwards, relaying messages from transputers on its right to the Root down to its left. The "idle 1" column represents the amount of idle time recorded during Phase 1, while the "idle 2" column represents the amount of idle time in the second phase. "Total idle" represents the sum of the previous two numbers. The last three columns represent the processor utilization, broken down into total utilization over the whole application, utilization during the first phase only, and utilization during the second phase.
Execution times (seconds) Utilizations (%)
Proc Total Phase Phase Phase Phase Total Total Phase Phase 2
Id time 1 2 1 2 Idle idle 1
Idle
1 41.76 23.98 17.77 7.56 13.85 21.42 48.98 68.46 22.06
2 41.92 27.38 14.54 12.54 12.44 24.98 40.50 54.20 14.44
3 41.91 32.07 9.84 16.60 8.42 25.01 40.41 48.25 14.46
4 41.89 36.47 5.43 20.36 4.64 25.00 40.45 44.17 14.45
5 41.87 41.87 0.00 25.02 0.00 25.02 40.40 40.24 NA
6 40.93 40.93 0.00 24.27 0.00 24.27 42.18 40.70 NA
Table
8-
1:
Execution times and utilization values for Mandelbrot Set calculation on six
transputers. NA entries indicate an undefined utilization (Phase 2 = 0
sec).The graphic representation of the execution times spent by each processor in each phase is truly indicative of the way the computation is organized, as shown in the left-most diagram of Figure 8-2.
We note that each transputer spends almost exactly the same amount of time running the application (first column contains almost identical numbers). The first and second phases, however, differ widely as a function of the processor Id. The smaller the _node_number, the shorter Phase 1 is, and the longer Phase 2 is. In other words, the Root is the first one to finish displaying its columns of pixels, confirming our comments about Figure 8-Error! Bookmark not defined. that the rightmost wave belonged to the Root. This is a result of the way the ProcAlt statement grants channel requests (see Exercise 8-1), favoring the local channel over the hard one.
It is interesting to see that the amount of idle time during Phase 1 is also proportional to _node_number. Obviously, this is linked to the fact that the Root is the first to finish its part of the Mandelbrot set, since each transputer must compute the exact same number of pixels. In phase 2 the transputers act as relays for their rightmost neighbors, and since they have no real computation to perform, the Idle 2 column reflects the waiting times between channel operations.
Another surprising observation is that the sum of the idle times over the whole computation is roughly the same for all six transputers. The breakdown in two phases clearly sheds more light on the way the transputers cooperate than if we had simply measured the total compute and idle times.
The overall utilization is about the same for all six transputers, and ranges between 40.40% and 42.18%. In other words, each transputer wastes almost 60% of its power. This agrees with the speedup value of 2.36 that we measured earlier. A necessary, but not sufficient, condition to reach a speedup of 6 with six transputers is that all transputers be utilized 100%. If we only harvest 40% of each transputer, then we can only expect to have a speedup as good as 6 x 0.40=2.40, which matches the value of 2.36 which we measured earlier.
| 8-3 |
This exercise deals with LS C techniques to measure accurately the execution time and idle time of a transputer program. Measuring the execution time of a program, or section of code on a transputer can be easily accomplished by a call to the Time() function. This function returns the current value of the timer associated with the current priority level. When an application uses mixed-priority tasks, however, calls to Time() may not return compatible values. How can one devise a function that will always return a consistent time, independently of the priority of the task that called the function? What is the accuracy of the function you just devised? . |
| 8-4 |
Measuring idle time in a transputer is more of a challenge than
measuring straight time. The problem is that if the transputer is idle, then
it is not doing anything, and if it is not doing anything, how can we make the
transputer record how much of "nothing" it is doing? One solution is to make each transputer create a low priority task as soon as the application starts. This task is low priority and very simply coded:
void IdleTask(Process *P, int *IdleCount)
{
IdleCount=0;
while (1)
Write an LS C program that does nothing but launch the IdleTask, lets it run for 1 second, and then display the value of IdleCount. What does the value displayed represent? How can it be used to compute the idle time experienced by a transputer? Give an example including a timing diagram showing how adding the IdleTask to a program can be used to get a fairly accurate measure of the inactivity of the processor. |
| 8-5 |
Implement the sorting algorithm presented in Exercise 8-2, and measure the processor utilization obtained. . |
where Speedup(N) refers to the speedup measured on N processors. The Mandelbrot set example yields the following efficiencies:
Number Speedup Efficienc Speedup Efficienc Speedup Efficienc
of (64-ppp) y (%) (240-ppp) y (%) (1 ppp) y (%)
Processor
s
1 1.00 100.00 0.98 98.00 0.83 83.00
2 1.64 82.00 1.38 69.00 1.23 61.50
3 2.01 67.00 1.59 53.00 1.45 48.33
4 2.27 56.75 1.62 40.50 1.62 40.50
5 2.43 48.60 1.62 32.40 1.61 32.20
6 2.36 39.33 1.63 27.16 1.61 26.83
Table
8-2: Efficiency of processors when Mandelbrot Set is computed on linear chains
ranging from 1 to 6 transputers.We see the efficiency of each processor degrading as the number of transputers increases. Clearly the application as written is not scaling well and each additional transputer lowers the efficiency dramatically. (One obvious reason is the poor choice of a chain network.)
An important property of the Efficacy curve when plotted as a function of the number of processors N, is that its first maximum corresponds to an optimal system operating point [EAGE89]. When a processor can be added to the computation with a resulting increase in efficacy, then the gain of the extra processor out measures the cost of adding it, where the cost is defined as the inverse of the efficiency. When, on the contrary, the efficacy diminishes, then the cost of the addition outweighs the potential performance gain. When applied to our example, we find that for 1 pixel per packet (ppp), the maximum efficacy achievable is 0.761933 for 2 transputers. For 64 ppp, it is 1.355165, also attained for 2 transputers, while for 240 ppp, the maximum of 0.973942 is reached only for 1 transputer.
| 8-6 |
Implement the parallel sorting algorithm presented in Exercise 8-2, and compute the processor efficiency obtained with your implementation on a chain of transputers. What is the optimal number of transputers for your implementation? . |
| 8-7 |
The efficiency can also be computed to compare two different parallel
machines. For example, if you implement the Mandelbrot Set example included on
the distribution disk, you will probably find that the execution times you
measure are different from the ones reported in this chapter. Three main
reasons may cause this difference. The first one is the rate of the clock
signal fed to the transputers in your system. The second likely reason might
be a different rate of the signal controlling the communication over the serial
links. Finally, the performance of the host system may affect the total
execution time, especially for an application such as the Mandelbrot Set where
more than 300,000 pixels must be individually displayed. The characteristics
of the Host-Transputer system used to generate the data used in this chapter
are 20-MHz transputer clock, 20 Mb/s data transfer rate, and a 66 MHz 486-DX PC
compatible with a super-vga graphics accelerator card. Referring to this
system as System A, and to your system as System B, compute the relative
efficiency of A versus B as follows:
|
where T is the time during which the packets were generated. An easy way to measure the throughput is to make each transputer maintain a count of the number of bytes transferred over hard channels, and to measure the time duration over which the packets are recorded. For the Mandelbrot set example, the counting is unnecessary, as we know ahead of time how many pixels each transputer is in charge of generating, and how long the program runs for. The picture shown in Figure 8-3 is captured from a VGA screen of 640*480 = 307,200 pixels. Since the transputer exchanges information over packets of K pixels, the total number of information packets generated is 307,200/K. Each transputer sends an extra packet at the end of its execution to indicate to its left neighbor that it and all other transputers on its right are done. Therefore, if N transputers are involved in the computation, we have a total of 307,200/K+N packets generated.
Figure 8-3 shows the throughput in bytes/sec of the three different implementations, at 1, 64 and 240 pixels per packet.
Note that since the amount of information exchanged in all three cases is about the same, the throughput bars for the 64 ppp case looks identical to the speedup curve of Figure 8-Error! Bookmark not defined.. This shouldn't be surprising, though, that measures that have different definitions may have similar graphs. We are simply viewing the same experiment from different directions. Each view relates to the same story, but presents a different angle which can be used to further the analysis of how well or bad the parallel application behaves on the network of transputers.
For example, Figure 8-3 shows that 64 ppp offers the higher throughput of the three packets sizes studied, and that the highest throughput is given by a chain of 5 transputers, which we already knew from the speedup curves. We also see that the largest increase in throughput occurs when we switch from 1 to 2 transputers, which is what the efficacy analysis revealed. So throughput, speedup, and efficacy are closely related to one another. We couldn't have predicted much about the processor utilization, however, indicating that throughput and utilization are less directly related.
| 8-8 |
The throughput that we have defined is an average computed over the total execution time. If more details are needed about the behavior of the program, we can refine the analysis by plotting a more "instantaneous" variation of the throughput. Doing so requires knowing when the transputers are active, when they are generating packets, and to plot this information as a function of time. Using the information listed in Table 8-1, generate a chart representing the variation of the throughput as a function of time in the chain of 6 transputers. . |
The performance measures we presented in this chapter provide tools general enough to yield accurate measures of goodness.