Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



8
MEASURING PERFORMANCE

8-1 Introduction

Performance is of paramount importance in parallel programming. The reason we are in the game of writing parallel programs is either to solve a problem faster than on a serial computer, or to solve a larger problem than could previously be done.

Measuring the performance of a parallel program is a way to assess how well and how efficient our efforts have been at dividing the application into modules cooperating with each other in parallel.

The most visible and easily recorded metric of performance is the execution time. By measuring how long the parallel program needs to run to solve our problem, we can directly measure its effectiveness. To find out how much better our program does on the parallel machine, compared to a program running on only one processor, taking the ratio of the two is a natural solution. This measure is called the speedup, and is associated with the number of processors in the machine. We will refer to a program has having a speedup of x on N processors, for example.

We will also investigate other performance metrics. Each one will provide us with a different dimension, or different view point to observe the "performance" of the system. The metrics are all related to each other, but highlight different components of the dynamics at work during the execution of the parallel program.

8-1 Execution Time and Speedup

The whole reason why we have spent all this time learning the intricacies of Logical Systems C is that we hope to be able to harness the parallelism offered by a group of transputers, and to write code that will run in less time than on a serial machine, such as a Personal Computer, a Macintosh, or a workstation (which all support interfaces to transputer systems).

Our first measure of performance relates to execution speed, and is a measure of how much faster the program runs on the parallel machine than it does on a serial machine. Let's define it formally:

Speedup = (Serial Execution Time)/(Parallel Execution Time)

Speedup = T(1)/T(N)

where T(N) represents the execution time taken by the program running on N processors, and T(1) represents the time taken by the best serial implementation of the application measured on one processor. In other word, T(1) is not only the execution time on one transputer, but it is the best time recorded. Although this definition is fairly straightforward, it hides some subtleties in the way T(1) is defined. We will come back to this point shortly.

An example

To illustrates the different aspects of performance measures, we will choose and example that is not only interesting but also fun to watch. It is the celebrated Mandelbrot set, which we have programmed to run on a chain of transputers[1]. The algorithm we have chosen works as follows: the video screen is logically divided into columns of pixels. A Personal Computer VGA screen, for example, will contain 640 columns of 480 pixels. Each transputer is assigned full columns of pixels. In a system of N transputers, the Root will be in charge of computing the pixels making up columns 0, N, 2N, 3N, etc., while Transputer 2 will be responsible for columns 1, N+1, 2N+1, 3N+1, and so on. Each transputer sends out the pixels as packets of P triplets (x,y,color), where P divides the column height evenly, and x and y represent the screen coordinates of the pixel.

The network is setup as a chain, where the Root receives all the packets of triplets and displays each one via a putpixel(x,y,color) statements similar to Turbo C's graphic command[2]. Although this algorithm is not the most efficient one we could have coded, it is illustrative of many important aspects of performance measures.


Figure 8-1: Mandelbrot set computed by 6 Transputers. Different "waves" indicate progress of individual processors.

Figure 8-1 is a screen capture of the Mandelbrot set during the program execution, on a screen originally all black. The waves (marked by tick marks on the bottom of the figure) show the progress of individual transputers, with the right-most wave, that is the wave that is closer to finishing belonging to the Root transputer, and the left-most wave, on the left of which the figure is complete, represents the progress of Node 5 and Node 6. The last two transputers produce columns at approximately the same rate, so distinction between the two is not possible on the figure. Timing the program on a chain of 1, 2, 3, 4, 5 and 6 transputers, respectively, for packets of P=64 pixels, yields the following data.


Figure 8- 2: Speedup curve for Mandelbrot example.

Applying the formula for the speedup, and using 99.230769 seconds as T(1) (see Table 8-1), we get the speedup curve shown in Figure 8-2.

linear speedup

The figure shows two curves, one straight, climbing at 45 degrees, and the other one, our speed-up curve, growing much slower and plateau-ing at 5 transputers. The first curve represents the upper bound for the speed-up. If an application takes 10 seconds on one processor, then the best we can expect for a speedup on two processors would be half as much time, or 5 seconds, yielding a speedup of 10/5=2. By the same token, if we run the same application on N processors, then we can expect at best a reduction in time of N, and therefore that the speedup will be at most N. Hence, the best we can ever expect to achieve when running on N processors is a speedup of N. We will refer to programs or applications that reach this (hard to achieve) feat as applications with linear speedup[3].

              Number of              Execution Time (seconds)         
             Transputers                                              
                  1                          99.230769                
                  2                          60.274725                
                  3                          49.285714                
                  4                          43.681319                
                  5                          40.714286                
                  6                          41.978022                

Table 8- 1: Total execution times for displaying the Mandelbrot set using 1 to 6 transputers.

Our application is far from exhibiting a linear speedup. In fact, it shows a downward trends at 6 transputers, indicating that its speedup may even decrease as we bring in more processors. This isn't what we had expected! Could it be that the size of the packet (64 pixels) is too low, or too high, and that a different packet size would have shown a steeper speedup curve?

This is a good question, and requires careful investigation in the area of load balancing, which is the subject of the Chapter 10. But we can still run a couple of other experiments and see if we can improve on this first example.

Running the same experiment for a packet size of 1 and 240 yields the results shown in Table 8-2.

        Number of          Execution Time       Execution Time            
       Transputers    (seconds) Packet size =   (seconds) Packet size =   
                             240 pixel          1 pixels                  
            1                 100.5495          119.5055                  
            2                 71.42857          80.38462                  
            3                 62.30769          68.07692                  
            4                 61.20879          61.20879                 
            5                 61.15385          61.59341                  
            6                 60.6044           61.59341                  

Table 8- 2: Variation of execution times for different packet sizes. A packet contains triplets representing each pixel: (x,y,color).

Is the speedup 1?

Let's compute the speedup, starting with the first column (240-pixel packets). What is the speedup for 1 transputer in this case? The answer is negative, because the definition calls for the best time recorded for that particular program running on one transputer. Out of the three different packet sizes we studied, the execution time for a packet of size 64 is the best, with 99.230769 seconds, so we should use this value, and not 100.5495. In fact, if we were very serious about the claims we are making about our speedup, we should look for all Mandelbrot set programs written for the transputer, and time each one on the Root of our system, and keep the best value we find as T(1).

Serial time

So the program with the best serial time should be the contender against which we should measure our parallel implementation. It is important, of course, that the program with the best serial time be measured on the same hardware we are using to run the parallel implementation. We should keep as many of the remaining parameters identical to both experiments, including processor clock speed, amount of memory available, link speed, host processor, host video hardware, and even the same compiler.

But why is this necessary? The reason is that this makes a more honest measure of speedup. An example will illustrate this point. Assume that you and a friend have a contest, and decide to test your programming and algorithmic skills by writing a parallel program solving a given problem, and that the program with the highest measure of speedup wins. At the due date, you meet your friend and compare the speedups for 10 processors. Your program has a speedup of 5.5, while your friend boasts a much nicer figure of 6.5. You loose... Well, it is time to congratulate your friend and exchange some information about the different approaches you both took. Discussing the details of the experiment, you realize both you and your friend computed the speedup as T(1)/T(10), and used your own programs to generate T(1). But you also find out that the execution times for your friend's program were 65 seconds and 10 seconds for T(1) and T(10), respectively, yielding the boasted speedup of 6.5. Your values, however, were 50 and 9.09 seconds. The 6.5 figure is definitely not a fair speedup measure then, since you were able to write a program to solve the given problem on one transputer in 50 seconds, 15 seconds faster than your opponent. The problem is that you both used your own program to generate T(1) without checking if a faster serial version existed. Your friend should therefore change his/her value of T(1) and use 50, which will now yield a more reasonable speedup of 50/10= 5!

This seemingly strict restriction on T(1) is important. In the past, claims of high speedups sometimes didn't hold true because the authors erroneously did not use the most efficient serial algorithm in their computation.

Going back to our Mandelbrot program, we see that the new experiments didn't improve our performance at all. The best value recorded is a speedup of 2.43 recorded on 5 transputers. Why is it so low, and why does it seem to be bound by above?

      Number of    Speedup (64-ppp4)  Speedup (240-ppp)  Speedup (1 ppp)    
     Transputers                                                            
            1            1.00           0.98              0.83              
            2            1.64          1.38               1.23              
            3            2.01          1.59               1.45              
            4            2.27          1.62               1.62               
            5            2.43          1.62               1.61               
            6            2.36          1.63               1.61              

Table 8-3: Influence of packet size on speedup. Note that the speedups for 1 and 240 ppp are not equal to 1 because the execution time for 64 ppp is faster, and should therefore be used as reference.

One of the answers lies in the network we chose. The more transputers we use in the chain, the harder it is for the Root to process all the packets and display them. The Root, because it must display all the points one at a time, in a serial fashion, becomes a bottleneck. But this also illustrates one important law of parallel programming: Amdahl's law.



Figure 8- 3: Example of the Amdahl's law at work




[Previous] [HOME] [NEXT]