Parallel Programming in C for the Transputer
© D. Thiébaut, 1995




EXERCISES


 
6-7

Looking at the code above, list several modifications that could make the computation more efficient and the execution time smaller.

.


 


Now, the main reason for us to choose and create this example is that it decomposes simply into two tasks, one in charge of doing the real work, Compute, and one, Dispatch, in charge of hauling the information back and forth between the producers and the consumers.

Before we ponder on which priority is better suited for the tasks, let's run several experiments where we assign all possible combinations of priorities to the tasks. Since we have two tasks and two priorities, four experiments are necessary. Each one simply requires the modification of two lines in the code, as shown in Table 6-1:

                    Compute     Dispatch               LS C Code

Experiment 1          Low         Low      ProcPar(DispatchP, ComputeP, 0);

Experiment 2          Low         High     ProcRunHigh(DispatchP);
                                           ProcPar(ComputeP, 0);

Experiment 3         High         Low      ProcRunHigh(ComputeP);
                                           ProcPar(DispatchP, 0);

Experiment 4         High         High     ProcToHigh(); ProcPar(DispatchP,
                                           ComputeP, 0); ProcToLow();

Table 6- 1: Priority assignment for experiments, and code changes.


Figure 6- 5: Execution times of prime-finding application on two transputers, with different combinations of priority levels assigned to the Dispatch and Compute tasks.

Figure 6-5 shows the result of running the four experiments, each one testing 600,000 integers with two T805 transputers. The height of the bars represents the total execution time expressed in seconds, while each bar is divided into two regions, showing the percentage of the total number of primes contributed by each node. For example, the first bar corresponds to the case where both the Compute and Dispatch tasks operates at low priority. Running the program, we measure a total execution time of 121.76 seconds. But, surprisingly, we also find that Node 2 tested only 134,248 integers, whereas Node 1 tested 464,752 integers, or 77% of the original 600,000 integers.

Surprising winner?

The clear winner is Experiment 3 where Dispatch is given high priority and Compute low priority. Isn't it contradictory, though, to demote Compute which, after all, does the only useful work in the application, with respect to Dispatch which only moves the information around? At first glance it may seems so. But the logic here is to give Dispatch high priority so that Compute will have to wait as little as possible to get its next integer.

Low-Low priority case

To better understand this, let's take a look at what happens when both tasks are launched at low priority. The timing diagram shown in Figure 6-6 illustrates the problem. The top numbers represent quantum events associated with the low priority tasks, and defining task switching instants. If a task is running and other low priority tasks are active, then, at the next quantum event the running task is put back on the queue of active tasks, and the next active task is run. If no other active tasks exist, then the running tasks keep going.

Let's follow the chain of actions occurring when Node 2 has just analyzed an integer and is ready to send the result to the Dispatch task running on the Root. Just before Timer Event 2, The Compute task in Node 2 "chans out" its number to the Dispatch task. This is represented by the arrow going from the third line to the fourth one. As soon as this occurs, Compute blocks and becomes inactive. Dispatch then initiates a ChanOut to send that information to the Root. In this example we assume that the ChanOut is issued just after Quantum Event 2. We selected this case because it is the worst case scenario and illustrates well the source of inefficiency in the all low-priority scheme. The Dispatch task on the Root node is blocked on a ProcAlt. It will awake only on a Timer Event, and only if one of the channels that it is monitoring is ready to send information. For this reason, the Dispatch task on the Root will only be able to accept the information sent by the Dispatch task of Node 2 only at Quantum Event 3. At this point in time the Compute task on Node 1 becomes inactive since the Dispatch task is unblocked from its ProcAlt. The Dispatch task gets the boolean, stores it in its array and sends back a new integer to test.

During all this time both the Dispatch and Compute tasks of Node 2 are inactive. Both are blocked on a ChanIn. Dispatch on a ChanIn from Node 1, and Compute on a ChanIn from Dispatch. Finally when the Dispatch of Node 1 sends the new integer, Node 2's Dispatch awakes, gets the integer and "chans it out" to its Compute partner which becomes active.

What should we be worried about in this picture? You will probably have guessed that because we selected the case where Node 2 sends it message right after Quantum Event 2 for Node 1, then Node 1 will not be able to respond until the next quantum event. Node 2 therefore must wait for almost a full low priority quantum, or 64 s, before Node 1 can service its request. This is the worst case, of course, but we can expect that on the average, Node 2 will have to experience approximately half a quantum delay when it sends its individual results back.

The broken arrow linking the ChanOut of the Dispatch task on Node 2 to the ChanIn of the Dispatch task on Node 1 contains the source of inefficiency of this scheme.

Figure 6- 6: Timing of exchange of data between two transputers when Dispatch and Compute tasks run at low priority.

Clearly the Dispatch task does not respond fast enough. The solution is to make it high priority. This results in a timing diagram similar to that of the preceding example, but one that does not contain a broken arrow indicating a delay during the communication between Node 2 and Node 1. This is shown in Figure 6-7. As soon as Dispatch on Node 2 chans out its result to Node 1, Dispatch on that node is unblocked since it has now a high priority, and starts receiving the information. Focus on the amount of time the Compute tasks are inactive. The Compute task on Node 1 suffers about the same amount of inactivity in both cases, but the one on Node 2 definitely suffers more in the first experiment. The effect of running Dispatch at low priority affects the two nodes differently. Node 1 is not affected, while Node 2 suffers a delay every time it sends a result back to Node 1.

This explains why Node 1 will treat about half as many integers more than Node 2 in Experiment 1 than in Experiment 3, where they both process the same amount of integers.

Figure 6- 7: Timing of exchange of data between two transputers when Dispatch is launched at high priority, and Dispatch at low priority.

Figure 6-5 shows that making all tasks high priority is as inefficient as making them both low priority. This is logical, since for Dispatch on Node 1 to respond right away to Node 2 it must preempt its Compute partner, and having Compute running at high priority prevents this from happening. Of course, when Compute is assigned higher priority than Dispatch, this certainly cannot happen either.

Priority Rule:
Communication is high priority

When multitasking is implemented, always assign high priority to the tasks that manage the transfer or buffering of information, and low priority to the tasks producing or consuming the information. This will result in shorter communication delays, improved response time, and better overall performance.


EXERCISES


 
6-8

Remove the Dispatch task from Node 2 and make Compute interface directly with the hard channel, and measure the difference in execution time. Before you start run the original program and time it. Then, knowing that the transputer can switch between tasks in about 1 s, estimate what the largest change in execution time you should expect. Modify the program by removing Dispatch from Node 2 and interfacing its Compute task to LINK1, and time the execution. Is the new execution time within your estimated interval? .


 

 
6-9
Let's investigate the effects of computation granularity. This problem is to find the optimum size of the information exchanged between the nodes so that the execution time is minimized. Our original program exchanges only one number at a time. Modify it so that blocks of numbers are exchanged at a time. Time your modified program with blocks ranging from 1, 2, ...up to 1024 integers. Plot the execution time as a function of block size. Explain the shape of the curve. What is the block size that minimizes the execution time?

Hints: When Compute has tested all the numbers in a block, it needs to return an array containing the booleans associated with each numbers in the block it received. You may speed up your program by sending back a bit-map, or an array of char rather than an array of integers. Also, when Dispatch on the Root sends blocks of integers to the Compute tasks, it does not need to send arrays at all. If you define the size of the array as a macro accessible to both nodes, then just sending one integer, the first one in a block, is sufficient for the Compute tasks to do their job. They will know the starting integer and the number of integers to test.


 

 
6-10
Using Figures 6-6 and 6-7 as examples, draw the timing diagram of the information exchange between Node 2 and Node 1 when both the Compute and Dispatch tasks run at high priority. Where is the source of inefficiency in this case?


 


Going High Priority to secure access to shared information

There are also instances when dynamically switching to high priority may be useful for a task. This is usually done for short actions that can be done quickly, and that will not hog up the processor. A good example is when a task requires priviledged access to shared information or shared code. Logical Systems C, for instance, did not protect the standard I/O functions by critical sections, and if two low priority parallel tasks running on a transputer decide to execute a printf statement concurrently, it was possible for the I/O operations to collide, which usually caused the program to crash. The problem is that one task may start the printf process, and partly initialize data structures internal to the printf low-level routines when its quantum expires, leaving the other task to initiate its call to printf, and clobber the internal data structures. When the first task resumes its operation, it will have lost its state and unpredictable results will occur.

The easy solution in this case is to restrict all I/O operations to one task only. But in some cases, such as during the debugging a program it is often useful to issue printf statements from several parallel tasks to trace their execution. It is then imperative to protect the I/O operations. A way to do this is to create a safe printf function, as shown below:

#include <stdarg.h>
void safeprintf(char *format,...)
{
      va_list argptr;
     int CurrentPrio;      /* keep track of current priority */

     va_start(argptr, format);
     CurrentPrio = ProcToHigh();
     vprintf(format, argptr);
     if (CurrentPrio==1)     /* bring task back to original priority */
          ProcToLow();
     va_end(argptr);
}

With this new function, and as long as the string printed is less than 80 characters in length, parallel tasks can safely print information concurrently and be sure that no conflict will occur.

safeprintf("Node %d: Dispatch received x=%d\n",_logical_node_number, x);
...
safeprintf("Node %d: Compute sending back %d\n",_logical_node_number,
x=IsPrime(x));

Note that this technique is not fool-proof! Although it looks like the vprintf statement is executed by only one task at a time, since in high priority it cannot be preempted, there still exists a potential problem. The vprintf statement sends a string to the host, and for this reason will "chan out" a packet to the host, and will automatically block during the transfer. This blocking will allow any other high or low priority tasks to take over the processor and issue their own safeprintf. However, since the new task starts after vprintf has had time to create the message, the interaction will be safe.

Another alternative is to associate a semaphore with the call to vprintf. This solution is less efficient since the transputer does not support semaphores in hardware, and the SemP and SemV functions must create their own list of processes spinning on the semaphore. However, its is recommended when the messages passed to vprintf are larger than 80 characters, as printf packages messages to print in packets of 80 bytes.

This technique of switching to high priority can be used in situations similar to the one presented here and not only for Input/Output operations.


EXERCISES


 
6-11

Modify the prime finding program used in this chapter so that it generates a trace of the activity of the Compute and Dispatch tasks on Node 1. Make it such that the tasks display the following information:

- Rl0 The relay task just received the boolean 0 from the local Compute task.

- Rr0 The relay task just received the boolean 0 from the remote Compute task

- Rl1 The relay task just received the boolean 1 from the local Compute task.

- Rr1 The relay task just received the boolean 1 from the remote Compute task.

- Cx The Compute task just received x from Dispatch.

Implement the printf statement with high priority. .


 

 
6-12
Modify the prime finding program so that it is comprised of two concurrent Compute tasks running on only one transputer, and executing the same code:
void Compute(Process *P, int which)
{
        int x, result;

        while (next < MAXDIM)
               Table[next++] = IsPrime(next);
}
They both share the integer next and the array Table, and alternate one after the other, finding all the primes between 1 and MAXDIM-1. Run the program, making sure both Compute tasks are launched at low priority (ProcPar). When the two task have finished, have main check the contents of the array and verify that each entry in the table correctly indicates whether its associated index is prime or not. Make your program report any incorrect information found in Table.

When you run the program with MAXDIM set to 1000, you will find that, indeed, the two tasks have computed many primes incorrectly. Explain why?

You will find that the problem is in the way Compute is written. Would modifying the while loop as in the following code improve or even correct the problem?

        while (next < MAXDIM)
        {
               result = IsPrime(next++);
               Table[next-1] = result;
        }

How about the following loop? (Be careful about this one!)

        while (next < MAXDIM)
        {
               x = next;
               Table[x] = IsPrime(x);
               next++;
        }
Can you come up with a safe and easy (different) solution?

 


6-6 Concluding Remarks

Debuggers are powerful tools that each programmer should have available and should know well. Even though sophisticated debuggers are available to analyze parallel programs, their efficiency comes in play with experience and practice. There are no well defined rules for approaching debugging. The programmer knows the program and what it is supposed to do, and the debugger can be used to capture glimpses of the parallel execution of the program. But, as with all physical experiments, the mere introduction of the debugger in the parallel application modifies its behavior in non deterministic ways.

Some steps can be taken to simplify debugging. One of them is to map the whole application to a single transputer. With this approach a tighter control of the execution of the different tasks can be achieved. The map.h module that we introduced is an example of tools that the programmer can build for him- or herself.

When several tasks run concurrently on the same node, it is important to assign the priority levels carefully. The golden rule will be to assign high priority to the tasks in charge of moving the information around, so that the ones actually involved with the computing will experience shorter delays when exchanging messages.

In the next chapter we look at deadlocks, which is one of the most common problems we will have to face. It is a well understood problem, and we will see that with a disciplined approach to programming, we will be able to program deadlock-free applications.



[Previous] [HOME] [NEXT]