Measuring the performance of a parallel program is a way to assess how well and how efficient our efforts have been at dividing the application into modules cooperating with each other in parallel.
The most visible and easily recorded metric of performance is the execution time. By measuring how long the parallel program needs to run to solve our problem, we can directly measure its effectiveness. To find out how much better our program does on the parallel machine, compared to a program running on only one processor, taking the ratio of the two is a natural solution. This measure is called the speedup, and is associated with the number of processors in the machine. We will refer to a program has having a speedup of x on N processors, for example.
We will also investigate other performance metrics. Each one will provide us with a different dimension, or different view point to observe the "performance" of the system. The metrics are all related to each other, but highlight different components of the dynamics at work during the execution of the parallel program.
Our first measure of performance relates to execution speed, and is a measure of how much faster the program runs on the parallel machine than it does on a serial machine. Let's define it formally:
Speedup = (Serial Execution Time)/(Parallel Execution Time)
Speedup = T(1)/T(N)
where T(N) represents the execution time taken by the program running on N processors, and T(1) represents the time taken by the best serial implementation of the application measured on one processor. In other word, T(1) is not only the execution time on one transputer, but it is the best time recorded. Although this definition is fairly straightforward, it hides some subtleties in the way T(1) is defined. We will come back to this point shortly.
The network is setup as a chain, where the Root receives all the packets of triplets and displays each one via a putpixel(x,y,color) statements similar to Turbo C's graphic command[2]. Although this algorithm is not the most efficient one we could have coded, it is illustrative of many important aspects of performance measures.

Figure 8-1:
Mandelbrot set computed by 6 Transputers. Different "waves" indicate progress
of individual processors.
Figure 8-1 is a screen capture of the Mandelbrot set during the program execution, on a screen originally all black. The waves (marked by tick marks on the bottom of the figure) show the progress of individual transputers, with the right-most wave, that is the wave that is closer to finishing belonging to the Root transputer, and the left-most wave, on the left of which the figure is complete, represents the progress of Node 5 and Node 6. The last two transputers produce columns at approximately the same rate, so distinction between the two is not possible on the figure. Timing the program on a chain of 1, 2, 3, 4, 5 and 6 transputers, respectively, for packets of P=64 pixels, yields the following data.
Figure
8-
2:
Speedup curve for Mandelbrot example.
Applying the formula for the speedup, and using 99.230769 seconds as T(1) (see Table 8-1), we get the speedup curve shown in Figure 8-2.
Number of Execution Time (seconds)
Transputers
1 99.230769
2 60.274725
3 49.285714
4 43.681319
5 40.714286
6 41.978022
Table
8-
1:
Total execution times for displaying the Mandelbrot set using 1 to 6
transputers.Our application is far from exhibiting a linear speedup. In fact, it shows a downward trends at 6 transputers, indicating that its speedup may even decrease as we bring in more processors. This isn't what we had expected! Could it be that the size of the packet (64 pixels) is too low, or too high, and that a different packet size would have shown a steeper speedup curve?
This is a good question, and requires careful investigation in the area of load balancing, which is the subject of the Chapter 10. But we can still run a couple of other experiments and see if we can improve on this first example.
Running the same experiment for a packet size of 1 and 240 yields the results shown in Table 8-2.
Number of Execution Time Execution Time
Transputers (seconds) Packet size = (seconds) Packet size =
240 pixel 1 pixels
1 100.5495 119.5055
2 71.42857 80.38462
3 62.30769 68.07692
4 61.20879 61.20879
5 61.15385 61.59341
6 60.6044 61.59341
Table
8-
2:
Variation of execution times for different packet sizes. A packet contains
triplets representing each pixel: (x,y,color).
But why is this necessary? The reason is that this makes a more honest measure of speedup. An example will illustrate this point. Assume that you and a friend have a contest, and decide to test your programming and algorithmic skills by writing a parallel program solving a given problem, and that the program with the highest measure of speedup wins. At the due date, you meet your friend and compare the speedups for 10 processors. Your program has a speedup of 5.5, while your friend boasts a much nicer figure of 6.5. You loose... Well, it is time to congratulate your friend and exchange some information about the different approaches you both took. Discussing the details of the experiment, you realize both you and your friend computed the speedup as T(1)/T(10), and used your own programs to generate T(1). But you also find out that the execution times for your friend's program were 65 seconds and 10 seconds for T(1) and T(10), respectively, yielding the boasted speedup of 6.5. Your values, however, were 50 and 9.09 seconds. The 6.5 figure is definitely not a fair speedup measure then, since you were able to write a program to solve the given problem on one transputer in 50 seconds, 15 seconds faster than your opponent. The problem is that you both used your own program to generate T(1) without checking if a faster serial version existed. Your friend should therefore change his/her value of T(1) and use 50, which will now yield a more reasonable speedup of 50/10= 5!
This seemingly strict restriction on T(1) is important. In the past, claims of high speedups sometimes didn't hold true because the authors erroneously did not use the most efficient serial algorithm in their computation.
Going back to our Mandelbrot program, we see that the new experiments didn't improve our performance at all. The best value recorded is a speedup of 2.43 recorded on 5 transputers. Why is it so low, and why does it seem to be bound by above?
Number of Speedup (64-ppp4) Speedup (240-ppp) Speedup (1 ppp)
Transputers
1 1.00 0.98 0.83
2 1.64 1.38 1.23
3 2.01 1.59 1.45
4 2.27 1.62 1.62
5 2.43 1.62 1.61
6 2.36 1.63 1.61
Table
8-3: Influence of packet size on speedup. Note that the speedups for 1 and 240
ppp are not equal to 1 because the execution time for 64 ppp is faster, and
should therefore be used as reference.One of the answers lies in the network we chose. The more transputers we use in the chain, the harder it is for the Root to process all the packets and display them. The Root, because it must display all the points one at a time, in a serial fashion, becomes a bottleneck. But this also illustrates one important law of parallel programming: Amdahl's law.