Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



Amdahl's Law

Amdahl [AMDA67] cast a pessimistic cloud on the practice of and the research in parallel computation when, in 1967, he pointed out the problem created by the serial component of all parallel programs, and showing that such component, even very small, automatically puts a cap on the possible speedup, as we have defined it. Amdahl put an end to the belief that the more processors are thrown at a problem, the faster it gets solved. The idea is simple in its elegance. Suppose that we have an application that we want to move from a serial machine to a parallel one, so that we can run it faster. In fact we want to run it much faster. After a careful analysis of its requirements and its code, we find that a great bulk of the code can be run in parallel. In fact 95% of the code can be run in parallel, and to make things simpler, we will assume that we won't incur any overhead in writing the parallel version of the application to the parallelization. The remaining 5% is simply code that must be performed serially. This could be the display of pixels on the screen by the root, for example. Only 1 transputer can execute this code, and it cannot be made parallel. We will refer to it as serial code. We still have the other 95% that can be made parallel. We can just port the program to a parallel computer with, say a thousands processors and we will get very good performance... Or will we? Let's take a look at this proposition Amdahl's way.

Figure 8-3 shows the execution times of this fictitious application when we run it on 1, 5, 10, 50, 100, 500, 1000, and 10,000 processors, along with the corresponding speedups. The vertical bars represent the execution time of the parallel code (light color) and of the serial code (darker color). For one transputer, their sizes are 95 and 5, representing 95% and 5% of T(1). The associated speedup is 1. When 5 processors run this application, the 95%-component is divided by 5, since it can be made parallel without any overhead[1], and shows up as a bar 19 units high. On top of it sits the 5-unit serial code, unchanged. The speedup is now 100/(19+5)=4.17, and not 5, as we could have expected earlier on. You will now have caught on the story. As we progress to the right, the height of the combined bars diminish, but the 5-unit component remains, and at 1000 and 10000 processors, the 95-unit bar is so spread-out that we can barely see its execution time on that many processors. The 5-unit bar however, is the one that is limiting the speedup. In fact the speed can only converge up to 20, without ever passing this bound.

If we call Rs the ratio of serial code in the original application, in our example 5%, then we can express T(1) and T(N) as:

As a result, Amdahl's law takes on the following simple form:

Following Amdahl's argument, we can no longer expect that any speedup value can be attained by simply using the required number of processors. If an application contains 1% of code inherently serial, then no matter how many processors are working on the application. A speedup of 1/1%=100 is the best we can expect.

Ways around Amdhahl's Law


Minsky's conjecture

Amdahl's law puts a damper on any enthusiasm we may have had about reaching high performance on our parallel transputer system. On any system, in fact, and for a while the accepted norm was that generally the speedup of an application would fall between two bounds: (ln2N) and (Nln2N), where N is the number of processors. The first one is referred to as Minsky's conjecture, and the other one has been considered a general upper bound for speedup curves [HWAN84]. But the Amdahl Law, although correct in its statement, can be circumvented by a very simple observation: It is stated assuming that N and Rs are independent. They are not.

Assume that the parallel application that we want to measure the speedup of, has serial and parallel components of different complexity. The complexity we are referring to here is the algorithmic measure of the variation of the execution time of the program as a function of the size of the data and the number of processors. The serial component may have a complexity of, say, O(N2), where N is the size of the problem. This serial component will very likely take the same amount of time when run on one processor or on several of them[2]. For the parallel component, however, this will be a different story. Assume that on one processor processing N pieces of data take time proportional to O(N3) , but that when the same N pieces of data are processed on an N-processor network, the complexity is O(N2) which is feasible in practice. What happens to the speedup? What we get is the ratio of terms that grow at different rates:

Our original view of a serial program as made up of a serial component and of a parallel component divisible into as many chunks as we have processors is overly simplistic. Often the complexity of the parallel component when run on a parallel machine will be different, and lower, than that of the serial algorithm, and by increasing the size of the data along with the number of processors, the serial component can be made negligible in the speedup equation, yielding far greater speedups than predicted by the Amdahl Law.

This is refreshing news. What it is telling us is that if there exists huge applications with enormous data sets, and efficient algorithms to process them, then it is worth building larger computers that can treat such large problems, because as we increase both the size of the data and the size of the machine, we can expect the speedup to increase. Weather prediction and seismologic simulation are such applications.

Gustafson and Barsis's Law

Gustafson and Barsis exploited this relationship between N and Rs when in 1988 they won the Gordon Bell Prize of $1000.00 offered to successful applications of parallel computing to the real-world problems [LEWI92]. Their approach was to circumvent Amdahl's law by scaling up the problem to fit the machine. Here is a simplified explanation of how it was done.

Assume that we have available some application that can be made parallel, and that when it is run on one processor, it takes some time T(1) = Ts + Tp.. For simplicity, assume furthermore that Tp, the parallel component, can be "divided up" at will into any number N of individual equal chunks of duration Tu, such that Tu represents the time it would take each processor of an N-processor machine to carry out the execution of the parallel part. Now let's choose N such that Ts+Tu is equal to 1 unit of time. This means that if we are careful coding your application, the parallel time T(N) will be equal to Ts+Tu, or 1. Hence the speedup of our application, when run on an N-processor machine is:

Gustafson-Barsis Law

This equation doesn't look that much interesting until we notice that since T(N) is equal to 1, then Ts is the proportion of serial code, in the application, and represents our original coefficient Rs. Since T(N) is equal to 1, we can substitute (1-Ts) for Tu, and get the following formula, known as the Gustafson-Barsis law:

When they won the prize, Gustafson and Barsis had reach a speedup up close to 1000 on a 1024 processor machine.

Before we move on to other performance measure, you should keep in mind that our description so far has avoided the issue of overhead associated with making an application parallel. For transputer systems, the cost of communication will be an important factor that will affect the speedup negatively. However, good control of the number, size, and timing of the data packets exchanged during the computation can bring this overhead down to reasonable amounts.


EXERCISES


 
8-1

You must be familiar enough with transputer programs by now to comment on the rate of completion of the transputers with the Mandelbrot set example. The network is a chain where all transputers compute the same number of columns of pixels. Each transputer, except the last one in the chain, also acts as a relay for packets generated by one side of neighbors for the Root located on the other side.

Each node runs two tasks, one computing (low priority) and one relaying (high priority). The relay task receives packets from the Compute task over a soft channel, and from neighbors on the "right" from a hard link. It then transfer them to the "left" over a hard link, except for the Root which sends them to the Host embedded in a "putpixel" graphic command.

The code executed by the relay task is of the following form:

int Toggle=1;

if (Toggle)
	index = ProcAlt(softc, LINK1IN, NULL);
else
	index = 1-ProcAlt(LINK1IN, softc, NULL);
Toggle = 1-Toggle;

The goal of this if-statement is to insure fairness among the two sources of packets, and to alternatively test softc first, then LINK1IN first, so that in case packets arrive constantly on both channels, then the source connected to the other won't starve.

Assuming a chain of N transputers, and assuming that each one generates P packets of pixels (this may not be exactly true in our implementation as packets containing only black pixels are not transferred by the Compute task), how many packets are processed by the Root? By Node 2? By Node 3? By Node N?

In view of your answer, how do you explain that the Root finishes displaying its columns first?

.


 

 
8-2
Consider the following algorithm for sorting N numbers on a linear chain of N processors. Each processor holds a buffer where one of the N items will be stored. As usual we assume that the Root is at the left of the chain. The Root reads the numbers one at a time from a file and processes them according to the following algorithm:

;N Processors

;Input: N random numbers

;Output: N items sorted in increasing order

S0: Get first number and put it in the buffer.

S1: Get the next number.

S2: If it is smaller than the number currently in the buffer then store that number in the buffer, and send the displaced number to the transputer on the right.

S3: If the number is larger than the one currently in the buffer, then pass it on to the transputer on the right.

S4: If less than N numbers have been read, go back to S1.

S5: Dump the number in the buffer to the host, and request N-1 numbers stored in the transputers on the right. Dump each one to the host as they arrive.

The numbers are returned to the host in increasing order

The other N-1 transputers follow the same steps, with slight modifications in S0, S4, S5, and S6. Give the description of the algorithm followed by Transputer Ti.

What is the complexity of this algorithm? What maximal speedup can we expect?

Adapt the algorithm so that a chain of P transputers can sort N numbers, with N>P. Assume that N is divisible evenly by P.

Implement this algorithm on a chain of P transputers, and compute the speedup it offers for various values of N (don't hesitate to sort huge amounts of numbers!).


 





[Previous] [HOME] [NEXT]