Parallel Programming in C for the Transputer
© D. Thiébaut, 1995




EXERCISES


 
6-1

Modify the prime finding program used in this section to illustrate the inner workings of Parascope, so that it runs on two transputers. The Print task will run on Transputer 1, the Root, while the Compute Task will reside on the second transputer. This time the two tasks will communicate over a hard channel. Run the same experiment implemented in this section, and stop the output of primes by pressing Ctrl-C before the last prime is printed. Then use Parascope (or your debugger) to find the status of the Compute Task residing on Transputer 2. In particular, display the contents of its local variables.

.


 

 
6-2
When starting several tasks with ProcPar(Task1P, Task2P...TasknP), which task starts first? Task1 or Taskn? To find out, write a one-transputer program that will start three tasks. Write the tasks in such a way that you can safely abort the cio server once the program is started and analyze its status with Parascope to find out which Task started first. (Be careful about using a printf statement. Having several parallel tasks use the same resource (the link to the Host) at the same time will likely result in the program crashing!)


 


6-1 Defensive Debugging Techniques

In the previous section we were interested in repairing the damages caused by programming bugs. Even though parallel debuggers can navigate through transputers and give us a glimpse of the state of the global computation, it is paramount to take several steps in the development of a parallel application so as to ease the debugging process and to reach the goal faster. In this section we study this approach, which we call defensive debugging.

Defensive debugging means that we should start coding in a manner that will ease our debugging effort. In fact, we should start thinking about debugging before we start coding at all. Given a machine structure where the different parts of the program reside on different processors using hard channels to communicate, and where only one of them, the Root, is connected to the host computer, it becomes very difficult to follow the execution of any code segment that does not reside in the Root. For example, examine the following simple code section.


main()
{
     int A;
     if (_node_number==1)     
     {
          ...
          ChanIn(LINK0IN, &A, INTSIZE);
          printf("Root: A = %d\n",A);
          ...
     }
     else if (_node_number==2)
     {
          ...
          printf("Node 2: A = %d\n",A);
          ...
     }
}

The two different code sections are running on two transputers, and both are using a variable named A. At a particular point in the computation, we want to check the contents of A in each task. Why not use a printf statement? It looks simple and should work. Unfortunately, it will not. Can we see why? Indeed, the printf statement of the second processor, Transputer 2, cannot execute correctly because all traffic is handled by the virtual channel router and the ChanIn and ChanOut operations will corrupt that sent by printf and will relay it directly to cio running on the host. However tempting, the implementation of such a relay task is not trivial, for the following reasons:

These are requirements that require serious thinking and extensive programming. Definitely not worth the time and effort when all we want is to find the status of a variable.

Root-Debugging

Hence the Root is definitely a privileged processor in the parallel machine, and we will monopolize on its properties. Our first approach to defensive programming, which we coin Root-Debugging, will be to map the whole program onto the root transputer. Although the idea is simple, its implementation will require careful planning, so that we can easily switch the application from 1 to several transputers and back, a cycle that may have to be repeated several times.

Debugging Communication

The second means at our disposition is to design a robust communication network. Logical Systems C provides a number of communication primitives with fail-safe mechanisms. Knowing how to use them well will allow us to detect conditions of starvation or of dead-lock in the network. The areas of dead-lock prevention, dead-lock detection, dead-lock avoidance are rich of on-going research. One reason for this abundance of interest is that dead-locks and starvation conditions have a propensity to appear uninvited even in carefully designed communication structures.

6-2 Root Debugging

Ideally, root debugging should always be the first step when writing a complex application from scratch. It is an essential part of the programming effort, and its implementation in no ways alters the top-down decomposition process involved in the writing of the program. It simply influences the "cosmetic" aspect of the parallel program, but not its structure.

As we discussed earlier, the idea behind root debugging is to first develop the parallel program so that it runs on the root transputer only. Three fundamental properties of the transputer allow us to take such an approach:

Multitasking allows us to have the root transputer run in parallel the tasks that normally would run on the different nodes of our network. Of course, we loose the simultaneous execution of these tasks. Once on the transputer, they can run only one at a time, following the transputer scheduling algorithm.

The second, more elusive property, is that the transputer treats hard and soft channels identically in C[1]. The difference between the two types of channels only appears at the microcode level. This gives the transputer invaluable flexibility in dealing with channels. The result is that all functions accepting channels as parameters will work equally well with hard channels as with soft channels. For example, assume that Ti and Tj are two tasks intended to run on two neighboring transputers. Assume furthermore that Ti and Tj exchange an integer in the following fashion:


{---- Task Ti, running on _node_number i ----}
void Taski(Process *P,  some argument definitions )
{
     ...
     ChanOut(LINK0OUT, &i, (int) sizeof(int));
     ...
}

{---Task Tj, running on _node_number j ---}
void Taskj(Process *P, some argument definitions )
{
     ...
     ChanOut(LINK1IN, &j, (int) sizeof(int));
     ...
}
Then we can easily make these two tasks run on the same transputer and "talk" over a soft channel. The only difference in the code above will be in the channels used.

/*---- Task Ti, running on _node_number i ----*/
void Taski(Process *P,  Channel *softc, some argument definitions )
{
     ...
     ChanOut(softc, &i, (int) sizeof(int));
     ...
}

/*---Task Tj, running on _node_number j ---*/
void Taskj(Process *P, Channel *softc,  some argument definitions )
{
     ...
     ChanOut(softc, &j, (int) sizeof(int));
     ...
}

Note that the code is exactly the same, except for the additional soft channel parameter that is passed to both tasks. We will have to make sure that the same channel is passed to both tasks initially.

Are the two properties seen so far sufficient for us to ensure that an application will run identically on one as well as on N transputers? They are sufficient to ensure that the tasks can run and exchange information on one transputer. Will the time-dependency of events, in particular the scheduling of data exchange be maintained? Yes, because communication over channels is blocking and unbuffered. This third property, forcing channel communication to block until both sender and receiver are ready to participate in the exchange, insures that communication events will occur in the same relative order when the tasks run on one transputer, as when the same tasks run on N transputers. These three properties are thus the key to root-debugging.


Figure 6-2: By modifying only the map module, the whole application which is a collection of four tasks and a map module can run on four transputers or on one transputer only.

Performance issue

The mapping of a full size network to only one transputer is bound to reduce the performance of the program. Although the program incorporates a multitasking component, in its root-debugging stage it is only as good as a serial program.

Mapping strategy

Careful book-keeping is required when we setup the 1-transputer version of our program, so that moving to the multi-transputer version can be accomplished easily. One of the golden rules of good programming practice learnt in introductory programming courses stresses that programs must be modular. We will therefore pay close attention to designing our application so that the details of the mapping are encapsulated. Our goal is shown in Figure 6-1.

A two-node example

Let's work out a simple example first. It requires two tasks: A producer, producing an integer, and a consumer, getting this integer and printing it. Here is the algorithm:

Network: chain

Host Node 1 Node 2

Links:

Node 1: LINK0 HOST
LINK1 Node 2

Node 2: LINK1 Node 1

Computation:

Node 1:
print message
get integer from Node 2
print integer

Node 2:
Send integer to Node 1

The simplicity of this application will be balanced by the complexity of the mapping module, and will allow us to better grasp its mechanics. You probably noticed that the algorithm specifies nodes, and not transputers. This is because we must make the application independent of the physical network running it. Whether the application is running on one or on two transputers, it will still implement a structure containing two nodes communicating with each other.

Let's define the two tasks running on Node 1 and on Node 2.

int Task1( arguments to be defined later )
{
int i;

printf("Task 1 started\n");
ChanIn(LINK1IN, &i, (int) sizeof(i));
printf("received %d\n",i);
}

int Task2( arguments to be defined later )
{
int i = 105;

ChanOut(LINK0OUT, &i, (int) sizeof(i));
}
You will notice that they use LINK1IN and LINK0OUT, as would have been expected if the two tasks had been intended to run on two neighboring transputers. The example of the previous section showed that the tasks must communicate via soft channels if we want to be able to map them to one transputer. In such a case, should we use LINK0OUT and LINK1IN? To answer this question, we simply need to look up the definitions of these terms in the conc.h header file:

#define LINK0OUT ((Channel *)(NOPROCESS | (_WORD_SIZE * 0)))
#define LINK1OUT ((Channel *)(NOPROCESS | (_WORD_SIZE * 1)))
#define LINK2OUT ((Channel *)(NOPROCESS | (_WORD_SIZE * 2)))
#define LINK3OUT ((Channel *)(NOPROCESS | (_WORD_SIZE * 3)))
#define LINK0IN  ((Channel *)(NOPROCESS | (_WORD_SIZE * 4)))
#define LINK1IN  ((Channel *)(NOPROCESS | (_WORD_SIZE * 5)))
#define LINK2IN  ((Channel *)(NOPROCESS | (_WORD_SIZE * 6)))
#define LINK3IN  ((Channel *)(NOPROCESS | (_WORD_SIZE * 7)))

Links are macros

We observe that the Links are simple macros representing memory addresses. As such, they can be redefined very easily. Keeping in mind that we should strive for modularity, we now create a new header file, map.h, which contains all the details of the mapping of our application. Because map.h is relatively complex, we explore each of its components first in isolation. The complete listing of map.h is shown at the end of this section.

/*
   map.h
   Remapping the hard channels...
*/
#ifndef map_h
#define map_h
#include "conc.h"

#define  RUN_ON_ROOT    /*define to map all tasks to root*/
#define  NO_XPUTERS 2       

   #ifdef   RUN_ON_ROOT    
	Channel  *Link0In[NO_XPUTERS+1];  
	Channel  *Link0Out[NO_XPUTERS+1]; 
	Channel  *Link1In[NO_XPUTERS+1];  
	Channel  *Link1Out[NO_XPUTERS+1];

	#undef  LINK0IN
	#define LINK0IN   Link0In[_logical_node_number]
	#undef  LINK0OUT
	#define LINK0OUT  Link0Out[_logical_node_number]
	#undef  LINK1IN
	#define LINK1IN   Link1In[_logical_node_number]
	#undef  LINK1OUT
	#define LINK1OUT  Link1Out[_logical_node_number]
   #endif
#endif

The macro RUN_ON_ROOT controls how the mapping is done. If RUN_ON_ROOT is defined, then the whole application is mapped to the root. If it is not defined, then the application is allowed to run in its originally intended mode, using as many transputers as are required.

If RUN_ON_ROOT is defined, then the LINK macros refer to array elements. The idea here is to associate with each node a series of eight soft channels acting as the hard channels, and whose names are closely associated with the hard channels they replace. The logical number of the node is used as an index to select the channel. This selection will allow us to very easily "attach" nodes to each other via the soft channels. But before we do so, let's come back to these channel arrays, and observe how the ChanOut statement of our example translates, depending on whether we define RUN_ON_ROOT or not. Here is the original statement:


ChanOut(LINK0OUT, &i, (int) sizeof(i));
When RUN_ON_ROOT is not defined, LINK0OUT refers to the hard channel, and the statement is unchanged by the preprocessor. When the RUN_ON_ROOT is defined, however, the preprocessor pp[2] will change the statement as follows:

ChanOut(Link0Out[_logical_node_number],  &i, (int) sizeof(i));

In this case, Task2, which is sending the integer i to Task1, will use a soft channel in the array Link0Out to communicate with Task1. If we take a look at the corresponding ChanIn statement in Task1, we will find that pp will have transformed it into:


	ChanIn(LINK1IN, &i, (int) sizeof(i));
You probably have caught on now with the role of _logical_node_number. Its function is similar to that of _node_number, but because our task may now be forced to run on the root, _node_number cannot be used anymore to make the code aware of the node it is working on. The solution is to pass to each parallel task a variable identifying the node number with which it is associated. This is what _logical_node_number does. Task1 will be passed the number 1. Task2, the number 2. This way, independently of the mapping, Task1 and Task2 will be know that, logically, they are working for Node 1 and Node 2, respectively.

Can they communicate?

What is missing in the above mapping for our example program to run when mapped to the root? The missing link is a channel! Right now we have Task2 send its integer via a channel that it gets from an array Link0Out[], and Task1 receive the integer via a channel obtained from Link1In[]. But we have neither defined a soft channel, nor have we assigned it to the two array elements. This is our next step.

int InitializeMapping()
{
    #ifdef RUN_ON_ROOT
    int i;

  
    /*--- use hard channels between Root and host ---*/
    Link0In[1] = LINK0IN;
    Link0Out[1]= LINK0OUT;
    AbortIf(!(Link1In[1]=ChanAlloc()),"Out of memory");
    AbortIf(!(Link1Out[1]=ChanAlloc()),"Out of memory");
    
    /*--- others nodes use soft channels ---*/
    for (i=2; i<NO_XPUTERS; i++)
    {
        Link0In[i]  = Link1Out[i-1];
        Link0Out[i] = Link1In[i-1];
        AbortIf(!(Link1In[i] =ChanAlloc()),"Out of memory");
        AbortIf(!(Link1Out[i]=ChanAlloc()),"Out of memory");
    }

    /*--- Last node uses hard channels too ---*/
    Link0In[NO_XPUTERS]  = Link1Out[NO_XPUTERS-1];
    Link0Out[NO_XPUTERS] = Link1In[NO_XPUTERS-1];
    Link1In[NO_XPUTERS]  = LINK1IN;
    Link1Out[NO_XPUTERS] = LINK1OUT;
    #endif

    return 0;
}

The function abortif, whose code is in the full listing of map.h simply exits if the condition it is testing is NULL. The function InitializeMapping is also part of the map.h file, and creates soft channels that it assigns to the Link0In, Link0Out, Link1In, and Link1Out arrays in order to emulate a chain of transputers. Figure 6-2 illustrates how this operation works. Observe that the physical layout used to connect the nodes influences directly how the arrays are initialized. The array elements here are represented separated from each other, and array cells with identical fill patterns contain identical channel pointers.

The for (i=2; i< NO_XPUTERS; i++) loop is a direct translation of this assignment of matching patterns in the figure. The code directly preceding and directly following the loop ensures that the first Node in the chain uses real hard channels to connect the first node of the chain to the Host, and, by the same token, makes sure the last Node in the chain uses hard channels for the "ending" links of the chain. Although this may seem bizarre, this approach simply reinforces the symmetry of our emulation, and allows this method to be used more broadly, when only parts of the nodes are mapped to the root, and when other applications, possibly fully debugged, are running on the other transputers of the chain.




[Previous] [HOME] [NEXT]