| 6-7 |
Looking at the code above, list several modifications that could make the computation more efficient and the execution time smaller.
. |
Now, the main reason for us to choose and create this example is that it decomposes simply into two tasks, one in charge of doing the real work, Compute, and one, Dispatch, in charge of hauling the information back and forth between the producers and the consumers.
Before we ponder on which priority is better suited for the tasks, let's run several experiments where we assign all possible combinations of priorities to the tasks. Since we have two tasks and two priorities, four experiments are necessary. Each one simply requires the modification of two lines in the code, as shown in Table 6-1:
Compute Dispatch LS C Code
Experiment 1 Low Low ProcPar(DispatchP, ComputeP, 0);
Experiment 2 Low High ProcRunHigh(DispatchP);
ProcPar(ComputeP, 0);
Experiment 3 High Low ProcRunHigh(ComputeP);
ProcPar(DispatchP, 0);
Experiment 4 High High ProcToHigh(); ProcPar(DispatchP,
ComputeP, 0); ProcToLow();
Table
6-
1:
Priority assignment for experiments, and code changes.

Figure 6-5 shows the result of running the four experiments, each one testing 600,000 integers with two T805 transputers. The height of the bars represents the total execution time expressed in seconds, while each bar is divided into two regions, showing the percentage of the total number of primes contributed by each node. For example, the first bar corresponds to the case where both the Compute and Dispatch tasks operates at low priority. Running the program, we measure a total execution time of 121.76 seconds. But, surprisingly, we also find that Node 2 tested only 134,248 integers, whereas Node 1 tested 464,752 integers, or 77% of the original 600,000 integers.
Let's follow the chain of actions occurring when Node 2 has just analyzed an integer and is ready to send the result to the Dispatch task running on the Root. Just before Timer Event 2, The Compute task in Node 2 "chans out" its number to the Dispatch task. This is represented by the arrow going from the third line to the fourth one. As soon as this occurs, Compute blocks and becomes inactive. Dispatch then initiates a ChanOut to send that information to the Root. In this example we assume that the ChanOut is issued just after Quantum Event 2. We selected this case because it is the worst case scenario and illustrates well the source of inefficiency in the all low-priority scheme. The Dispatch task on the Root node is blocked on a ProcAlt. It will awake only on a Timer Event, and only if one of the channels that it is monitoring is ready to send information. For this reason, the Dispatch task on the Root will only be able to accept the information sent by the Dispatch task of Node 2 only at Quantum Event 3. At this point in time the Compute task on Node 1 becomes inactive since the Dispatch task is unblocked from its ProcAlt. The Dispatch task gets the boolean, stores it in its array and sends back a new integer to test.
During all this time both the Dispatch and Compute tasks of Node 2 are inactive. Both are blocked on a ChanIn. Dispatch on a ChanIn from Node 1, and Compute on a ChanIn from Dispatch. Finally when the Dispatch of Node 1 sends the new integer, Node 2's Dispatch awakes, gets the integer and "chans it out" to its Compute partner which becomes active.
What should we be worried about in this picture? You will probably have guessed that because we selected the case where Node 2 sends it message right after Quantum Event 2 for Node 1, then Node 1 will not be able to respond until the next quantum event. Node 2 therefore must wait for almost a full low priority quantum, or 64 s, before Node 1 can service its request. This is the worst case, of course, but we can expect that on the average, Node 2 will have to experience approximately half a quantum delay when it sends its individual results back.
The broken arrow linking the ChanOut of the Dispatch task on Node 2 to the ChanIn of the Dispatch task on Node 1 contains the source of inefficiency of this scheme.
This explains why Node 1 will treat about half as many integers more than Node 2 in Experiment 1 than in Experiment 3, where they both process the same amount of integers.
Figure 6-5 shows that making all tasks high priority is as inefficient as making them both low priority. This is logical, since for Dispatch on Node 1 to respond right away to Node 2 it must preempt its Compute partner, and having Compute running at high priority prevents this from happening. Of course, when Compute is assigned higher priority than Dispatch, this certainly cannot happen either.
Priority Rule: | When multitasking is implemented, always assign high priority to the tasks that manage the transfer or buffering of information, and low priority to the tasks producing or consuming the information. This will result in shorter communication delays, improved response time, and better overall performance. |
| 6-8 |
Remove the Dispatch task from Node 2 and make Compute interface directly with the hard channel, and measure the difference in execution time. Before you start run the original program and time it. Then, knowing that the transputer can switch between tasks in about 1 s, estimate what the largest change in execution time you should expect. Modify the program by removing Dispatch from Node 2 and interfacing its Compute task to LINK1, and time the execution. Is the new execution time within your estimated interval? . |
| 6-9 |
Let's investigate the effects of computation granularity. This problem
is to find the optimum size of the information exchanged between the nodes so
that the execution time is minimized. Our original program exchanges only one
number at a time. Modify it so that blocks of numbers are exchanged at a time.
Time your modified program with blocks ranging from 1, 2, ...up to 1024
integers. Plot the execution time as a function of block size. Explain the
shape of the curve. What is the block size that minimizes the execution time?
Hints: When Compute has tested all the numbers in a block, it needs to return an array containing the booleans associated with each numbers in the block it received. You may speed up your program by sending back a bit-map, or an array of char rather than an array of integers. Also, when Dispatch on the Root sends blocks of integers to the Compute tasks, it does not need to send arrays at all. If you define the size of the array as a macro accessible to both nodes, then just sending one integer, the first one in a block, is sufficient for the Compute tasks to do their job. They will know the starting integer and the number of integers to test.
|
| 6-10 |
Using Figures 6-6 and 6-7 as examples, draw the timing diagram of the
information exchange between Node 2 and Node 1 when both the Compute and
Dispatch tasks run at high priority. Where is the source of inefficiency in
this case?
|
The easy solution in this case is to restrict all I/O operations to one task only. But in some cases, such as during the debugging a program it is often useful to issue printf statements from several parallel tasks to trace their execution. It is then imperative to protect the I/O operations. A way to do this is to create a safe printf function, as shown below:
#include <stdarg.h>
void safeprintf(char *format,...)
{
va_list argptr;
int CurrentPrio; /* keep track of current priority */
va_start(argptr, format);
CurrentPrio = ProcToHigh();
vprintf(format, argptr);
if (CurrentPrio==1) /* bring task back to original priority */
ProcToLow();
va_end(argptr);
}
With this new function, and as long as the string printed is less than 80 characters in length, parallel tasks can safely print information concurrently and be sure that no conflict will occur.
safeprintf("Node %d: Dispatch received x=%d\n",_logical_node_number, x);
...
safeprintf("Node %d: Compute sending back %d\n",_logical_node_number,
x=IsPrime(x));
Note that this technique is not fool-proof! Although it looks like the vprintf statement is executed by only one task at a time, since in high priority it cannot be preempted, there still exists a potential problem. The vprintf statement sends a string to the host, and for this reason will "chan out" a packet to the host, and will automatically block during the transfer. This blocking will allow any other high or low priority tasks to take over the processor and issue their own safeprintf. However, since the new task starts after vprintf has had time to create the message, the interaction will be safe.
Another alternative is to associate a semaphore with the call to vprintf. This solution is less efficient since the transputer does not support semaphores in hardware, and the SemP and SemV functions must create their own list of processes spinning on the semaphore. However, its is recommended when the messages passed to vprintf are larger than 80 characters, as printf packages messages to print in packets of 80 bytes.
This technique of switching to high priority can be used in situations similar to the one presented here and not only for Input/Output operations.
| 6-11 |
Modify the prime finding program used in this chapter so that it generates a trace of the activity of the Compute and Dispatch tasks on Node 1. Make it such that the tasks display the following information: - Rl0 The relay task just received the boolean 0 from the local Compute task. - Rr0 The relay task just received the boolean 0 from the remote Compute task - Rl1 The relay task just received the boolean 1 from the local Compute task. - Rr1 The relay task just received the boolean 1 from the remote Compute task. - Cx The Compute task just received x from Dispatch. Implement the printf statement with high priority. . |
| 6-12 |
Modify the prime finding program so that it is comprised of two
concurrent Compute tasks running on only one transputer, and executing the
same code:
void Compute(Process *P, int which)
{
int x, result;
while (next < MAXDIM)
Table[next++] = IsPrime(next);
}
They both share the integer next and the array Table, and alternate one after
the other, finding all the primes between 1 and MAXDIM-1. Run the program,
making sure both Compute tasks are launched at low priority (ProcPar). When
the two task have finished, have main check the contents of the array and
verify that each entry in the table correctly indicates whether its associated
index is prime or not. Make your program report any incorrect information
found in Table.When you run the program with MAXDIM set to 1000, you will find that, indeed, the two tasks have computed many primes incorrectly. Explain why?
You will find that the problem is in the way Compute is written. Would modifying the while loop as in the following code improve or even correct the problem?
while (next < MAXDIM)
{
result = IsPrime(next++);
Table[next-1] = result;
}
How
about the following loop? (Be careful about this one!)
while (next < MAXDIM)
{
x = next;
Table[x] = IsPrime(x);
next++;
}
Can you come up with a safe and easy (different) solution?
|
Some steps can be taken to simplify debugging. One of them is to map the whole application to a single transputer. With this approach a tighter control of the execution of the different tasks can be achieved. The map.h module that we introduced is an example of tools that the programmer can build for him- or herself.
When several tasks run concurrently on the same node, it is important to assign the priority levels carefully. The golden rule will be to assign high priority to the tasks in charge of moving the information around, so that the ones actually involved with the computing will experience shorter delays when exchanging messages.
In the next chapter we look at deadlocks, which is one of the most common problems we will have to face. It is a well understood problem, and we will see that with a disciplined approach to programming, we will be able to program deadlock-free applications.