Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



4
Communication Primitives

In this chapter we write our first multi-transputer on the simple network topology offered by a linear chain. We will do this for two reasons. First, a chain is compatible with all transputers, and second, the chain is sufficiently simple that it will allow us to concentrate on the task-level parallelism rather than on the network parallelism.

We will cover each new programming concept in detail, looking at examples and building illustrative programs at every step.

We will start with transputer-to-transputer communication, examining how programs running on neighboring transputers exchange information. This will allow us to divide our application among several transputers.

The next step will be to look at how parallelism can be implemented within a transputer, and to create parallel tasks running on a single transputer.

Once we know how to control inter-processor parallelism, as well as internal multitasking, we will combine the two and build a general framework for parallel programs.


Figure 4 -1: The transputer chain. Notation and convention.

4-1 The network topology

Our next program will run on a chain of transputers. This configuration is just as easily implemented with single transputer boards as it is with more sophisticated multiprocessor systems. The chain is also an important network configuration to study for systems based on T400 transputers which have only two operational links, and because it will introduce us to the discipline of carefully planning and laying out our application on a physical network.

To program an application on any network, it is essential to know the exact details of the interconnect linking the transputers to each other. Since each transputer has two potential links it can use to exchange information with its neighbors, we need to know how each transputer is connected to its neighbor, and which links they are both using to communicate with each other.


Parent node-Child node

We will strive, as much as possible, to wire the transputers in a uniform way. This uniformity calls for some definitions. The first definitions we will introduce are those of parent nodes and children nodes in the chain. The parent/child relationship is similar to that in a tree, except that in our case the tree is unary. The Root transputer coincides with the root of the tree. Each transputer in the chain talks to its parent node over the same link, and to its child node over the same link. Since the root transputer communicated with the PC host through its Link0 (the PC/link) by default, this automatically sets the rules of our convention.

Rule 1

The PC host will be referred to as the parent of the root node. In turn, the root node is the parent of the next node down the chain.

Rule 2

We will assume that the parent node is always on the left of its child node (if any). The leftmost entity is therefore the PC host. The next to leftmost is the root node. A child node will therefore be on the right of its parent.

Rule 3

Each transputer will communicate with its parent via its Link0, and with its child (if any) through its Link1. The Root Transputer will be given the numeral 1, its child node will be labeled 2, and so on.

Figure 4-10 illustrates our convention.

4-2 Transputer to Transputer Communication

Let's start with a simple program that computes prime numbers. This algorithm is an algorithm of choice when dealing with parallelism, as it can be easily be tailored to many approaches. See [CARR90], for example, for interesting applications of this algorithm. We have chosen an implementation of the prime-finding algorithm that is not the most inefficient, but extremely simple to implement. In a first attempt to write a parallel program, this will be fine.

The algorithm works as follows. To find if integer x is prime, the program attempts to divide x by all integers lower than sqrt(x) Testing up to sqrt(x) is sufficient, but requires a square root or a multiplication operation. Our current goal is not to compute primes efficiently (yet), but to explore different ways to parallelize the program with Logical Systems' parallel library.

Behold the program:

/*
=======================================================================
   prime_v1.c

   DESCRIPTION: Computes primes on 2 nodes.

   Node 2 computes the primes and sends them to Node 1.
   Node 1 (Root) collects the numbers and prints them.

   TO COMPILE AND RUN:

make -f prime_v1 chainnif -# 2 -1 prime_v1 -nif prime_v1.nif ASSOCIATED NIF-FILE buffer_size 200; host_server CIO.EXE; level_timeout 400; decode_timeout 2000; 1, prime_v1, R0, 0, 2, ,; 2, prime_v1, R1, 1, , ,; ====================================================================== */ #include <stdio.h> #include <stdlib.h> #include <conc.h> /* transputer library */ /* ============================= GLOBALS ============================== */ #define INTSIZE((int) sizeof(int)) #define INTERVAL 100 #define SENTINEL -1 /* --------------------------------------------------------------------*/ /* MAIN */ /* --------------------------------------------------------------------*/ main() { int NoPrimes = 0, x, j; /* ===============================================================*/ /* NODE #1 */ /* ===============================================================*/ /*--- Receive information and prints it ---*/ if (_node_number==1) { /*--- while numbers are coming, get them and print them---*/ do { /*--- get input-channel address ---*/ ChanIn(LINK1IN, &x, INTSIZE); if (x!=SENTINEL) {

printf("%8d", x); NoPrimes++; } } while (x!=SENTINEL); printf("\nReceived %d primes \n", NoPrimes); exit(0); } else /* ===============================================================*/ /* NODE #2 */ /* ===============================================================*/ { /*--- scan interval ---*/ for (x = 1; x<INTERVAL; x++) { /*--- if prime then send number to Node 1 ---*/ if (IsPrime(x)) ChanOut(LINK0OUT, &x, INTSIZE); } /*--- signal Node 1 that we are done ---*/ x = SENTINEL; ChanOut(LINK0OUT, &x, INTSIZE); } } /* --------------------------------------------------------------------*/ /* ISPRIME */ /* Semi-efficient prime finding function which, given x, returns 1 if*/ /* it is prime, and 0 otherwise*/ /* --------------------------------------------------------------------*/ int IsPrime(int x) { int i; /* 0 1 2 3 4 5 6 7 8 9 */ static int SmallPrimes[10] = {0,0,1,1,0,1,0,1,0,0}; if (x<10) return SmallPrimes[x]; if (x%2==0) return 0; if (x%3==0) return 0; if (x%5==0) return 0; for (i = 2; i*i<=x; i++) if (x%i==0) return 0; return 1; }



Listing 4-1: prime_v1.c

We assume here that a function prime() is available, and that it returns 1 if x is prime, and 0 otherwise. The complete program along with its make file is included in the distribution disk, and can be run with the following simple commands:

	make -f prime_v1.mak

chainnif -# 2 -nif prime_v1.nif -1 prime_v1
ld-net prime_v1

The first command calls the make utility (see Chapter 3) to preprocess, compile, assemble and link the program. The second command creates a network file called prime_v1.nif which specifies that two transputers are to run the same program named prime_v1. The third command loads the program on the first two transputers in the chain: the root and its direct child.

The output of the program is shown below.

	      2       3       5       7       11      13      17      19
	      23      29      31      37      41      43      47      53
	      59      61      67      71      73      79      83      89
	      97
	Received 25 primes 

Let's take a closer look at the code. First, it contains only the standard main() function, which is divided into two parts, both preceded by a comment box. The first section contains the code that runs on the root transputer (Transputer 1), while the second one contains the code running on Transputer 2. The same program is loaded on both transputers, but a different section of the code is actually used by each. This mode of programming is called farm programming [COX91].

farm programming

Farm programming consists in putting in one program the many different tasks that must be executed by the different transputers. This program is then loaded on all the transputers, and each transputer selects the section of the program that it has been assigned to. The selection is done by testing the predefined variable _node_number, which contains a different value depending on which transputer tests it[1]. The best place to put the test is in main().

main()
{
     if (_node_number==1)
     {	
	/* code to be executed by Transputer #1 */
     }
     else
     {
          /* code to be executed by Transputer #2 */
     }
}

Note that we could have written the above test as two if-statements, one testing _node_number against 1, the other against 2, with exactly the same result.

As you can imagine, this method of programming may become quite inefficient when the network contains many transputers. In such a case the program is partitioned into many blocks, only one of which is executed by each transputer which must still allocate enough memory to store the whole program. This method has the advantage of being easy to implement, however, since only one code file needs to be created and compiled. This is the method we shall adopt in this chapter.

conc.h library

Let's go back to the program and analyze it line by line. The first unfamiliar statement is the #include <conc.h> statement. The conc.h library, first implemented by Jeffrey Mock of Pixar [MOCK90], contains functions and variables implementing an Occam-like concurrency model. It must be included every time we write a program for the transputer[2].

Let's look now at the way Node 1, the Root, receives information from Node 2.

	if (_node_number == 1)
		/*--- get numbers from Node 2 and print them---*/
		do
		{
			/*--- get input-channel address ---*/
			ChanIn(LINK1IN, (char *) &x,INTSIZE);
			 if (x != SENTINEL)
			 {
				printf("%8d",x);
				NoPrimes++;
			} 
		} while (x != SENTINEL);
		printf("\nReceived %d primes \n",NoPrimes);
		exit();

ChanIn()

The ChanIn function stands for Channel-Input. It receives information from a channel, here LINK1IN (a constant predefined in conc.h), and stores it at the address specified by the type-cast pointer &x. LINK1IN is the memory address of the I/O Port corresponding to the input side of the link. We saw in Chapter 2 that the I/O ports are memory-mapped. LINK1IN is thus a macro representing the memory address of the port. INTSIZE is a macro that is defined at the beginning of the program as representing the size of an integer. Hence the function of ChanIn is to receive some amount of information from a channel and to store that information at a location defined by a pointer. The size of the information received is defined by an integer representing the number of bytes contained in the message. The prototype for ChanIn is:

void ChanIn(Channel *, void *, int)

Channels, hard channels, soft channels

As we just discovered, tasks executing on different transputers communicate with each other through channels. A channel is a means of communication for parallel tasks. It is very tempting to associate a channel with a transputer link. In many cases the two will be the same. In such instances we refer to the channel as a hard channel. But from our discussion of the transputer support for multitasking in Chapter 3, you will remember that transputers are also designed to run internal "parallel" tasks as well, through a technique called multitasking. When tasks reside on the same transputer, they can use the same mechanism to communicate with each other. The concept of channel encompasses that of the Link, and extends beyond it. It is one level of abstraction away from the hardware link. We'll come back to this concept when we actually implement multitasking on a transputer.

Byte counter

Finally, the third argument needed by ChanIn is the number of bytes contained in the message. The sizeof function will often be used here. #define will come in handy for that purpose. The example below shows how we can transfer an integer, a string of 20 characters, and an array of 10 integers.

	int i, table[10];
	char string[20];

	ChanIn(LINK1IN, &i, (int) sizeof(int));
	
	ChanIn(LINK0IN, string, (int) sizeof(string));
	
	ChanIn(LINK1IN, table, (int) sizeof(table));
So, the main loop executed by Node 1 is fairly simple. It gets an integer from the channel LINK1IN, and stores it in x. If the value of x is not equal to the sentinel, it prints it and increments a counter. When a value equal to the sentinel is received, it exits the loops and stops.

Before we move on to Node 2, let's analyze the last statements executed by Node 1:

	
if (_node_number == 1)
{
/*--- get numbers from Node 2 ---*/
do
{
ChanIn(LINK1IN, &x,INTSIZE); if (x != SENTINEL)
{
printf("%8d",x);
NoPrimes++;
}
} while (x != SENTINEL);
printf("\nReceived %d primes \n",NoPrimes); exit(0); }
The call to exit() terminates the portion of the code executed by the root transputer. Is this truly necessary? After all, Node 1 cannot execute the second part of the code that is reserved for Node 2 because of the else statement, so the next statement executed by Node 1 is the end bracket of the main function, and execution automatically terminates at that point. Right? Wrong!

When the Root transputer reaches the closing bracket of the main function, its execution of the program effectively stops, and it returns control to some internal kernel. But the host computer, the PC, is still executing cio, and will continue executing it until the user stops it with a control-break sequence on the keyboard, or by resetting the computer. How did cio enter the picture, you may wonder. In the network information file! Can we tell cio that the transputer computation is done and that control of the host computer can be returned to DOS? The answer, you will have guessed, is yes, by having the root transputer call the exit() function. Here exit() not only terminates the program, it also terminates the cio server.

Exit Rule

Always terminate the code executed by the Root processor with an exit statement. Exit passes a signal to the cio driver running on the host computer instructing it to stop and to return to DOS.

. Let's continue now with the code executed by Node 2.


else { /*--- scan interval--*/ for (x = 1; x < INTERVAL; x++) if (IsPrime(x)) ChanOut(LINK0OUT, &x, INTSIZE);

/*--- signal Node 1 that we are done ---*/ x = SENTINEL; ChanOut(LINK0OUT, &x, INTSIZE); }


Node 2 executes a loop in which x takes all the values between 1 and INTERVAL, defined as 100. For each new value, x is tested by IsPrime(). If IsPrime() returns a non zero value, x is sent to Node 1 with a call to ChanOut.

ChanOut()

ChanOut stands for Channel Output. It is the companion of the ChanIn function executed by Node 1. Its syntax is exactly the same as that of ChanIn:

void ChanOut(Channel *, void *, int)

It needs a channel pointer indicating which channel the information is sent over. In our case, the predefined constant LINK0OUT is used. It corresponds to LINK0, and specifies the outgoing direction. This information is then followed by a pointer to the area of memory containing the information to send. Here again a void pointer identifies the memory address of the information to be sent, and the third argument specifies how many bytes constitute the message.

Channel pointers

Why are the channel pointers used by Node 1 and Node 2 different? Node 1 receives its information from a channel associated with a predefined constant LINK1IN, while Node 2 sends its information out through a channel identified by LINK0OUT. Is this correct? The answer is yes. Let's go back quickly to Figure 4-1. Our adopted convention is that a parent talks to its child through its Link 1, while a child responds through its Link 0. So LINK1IN is a constant representing the Link 1 input of a transputer, while LINK0OUT represent the output of Link 0. Because hardware links are bi-directional, two constants are needed per link. These constants are defined in conc.h and allow us to Chan(nel) information In and Out: LINK0IN, LINK0OUT, LINK1IN, LINK1OUT, LINK2IN, LINK2OUT, LINK3IN, and LINK3OUT. Following the convention established at the beginning of this chapter, we now have a new rule to adopt:

Channel Input/Output Rule

On a chain of transputers, communication between a node (parent) and its child node must take one of the following forms only:

	Parent			---->		Child
	ChanOut(LINK1OUT,...)			ChanIn(LINK0IN,...)

	Parent			<----		Child
	ChanIn(LINK1IN,...)			ChanOut(LINK0OUT,...)


There are two extremely important properties to remember when it comes to communicating over channels: the communication is blocking, and there is not buffering or packaging done with the information. We already covered these properties in Chapter 2, when we studied how transputers communicate with each other. This is a good example of where hardware properties directly control the way software primitives work.

Blocking

Blocking means that no transfer can take place unless both tasks exchanging data are ready to do so. In the context of our example, this means that if Node 2 is the first one to reach the ChanOut(LINK0OUT,...) statement, then it will not send any information until Node 1 executes the matching ChanIn(LINK1IN,...) statement. Moreover, the task running on Node 2 goes to sleep as soon as it executes the ChanOut statement, and does not awake until the transfer has completed. This was implemented by Inmos to allow other tasks to run on a transputer while one is awaiting a transfer through a channel. This type of communication is similar to the Ada rendez-vous, where the actual transfer takes place only when the two players involved are ready to exchange data.

No buffering or packaging of information

The absence of buffering means that no memory is used to hold the message being transferred (or received) during the communication. The message is sent as is, from first byte to last byte, without any packaging. The side-effect of this is that when a task starts sending a message to another task, the receiving task does not receive a header telling it how many bytes are coming. The receiving task must know in advance how many bytes to expect. Unpredictable results will occur if too many or too few bytes are sent by a transmitting task. In such cases, the computation will not be successful.

In our example, this means that both tasks must know that integers are exchanged on each transfer. This is easily accomplish by passing the macro INTSIZE to all ChanIn and ChanOut call. But because each prime is sent as it is computed, and because Node 1 does not know a priori how many primes will be found by Node 2, some mechanism must be used to insure that both Node 1 and Node 2 reach the end of the computation. The solution is for Node 2 to pass a sentinel to Node 1 when it completes its computation.

Matching ChanIn and ChanOut

When transferring information, the number of bytes sent by ChanOut must match exactly the number of bytes expected by ChanIn.

Now that we know how the ChanIn and ChanOut commands work, what really happens when Node 2 sends a prime to Node 1, but Node 1 is still in the process of printing the previous prime? This situation is illustrated in Figure 4.1. The top diagram represents the execution of the code running on Node 1 as a function of time. The bottom diagram shows the behavior of the task on Node 2. Let's concentrate on the task on Node 2. It runs (literally) until it executes the ChanOut command. It has found a prime and is sending it. At that moment, it enters an inactive state, shown in the diagram as a "resting" state. Nothing happens until the task on Node 1, a few moments later, executes a ChanIn on the corresponding channel. As soon as it does so, it too goes into inactive mode! This is very important. Both tasks go into inactive state. The designers of the transputer implemented this feature so that other tasks could share the transputer while the communicating tasks were involved with data transfer. The data transfer is not under software control, but under the authority of the link controller in the transputer, and the tasks simply rest until the hardware has transferred the data. This transfer time is marked by the interval between the two dashed lines. Finally, when the transfer is over, both tasks are awaken and resume execution. They are both "running" again. This is true here since each transputer executes only one task. Were either one of the transputers running multiple tasks, then it is possible that the sending or receiving task could have remained inactive for some time. But enough of this for now; we'll come back to this point later on when we deal with multitasking.


Figure 4.1: Timing of a ChanOut operation occurring before its matching ChanIn.

A slight relaxation of the Matching Rule when transfer takes place on hard channels is that it is not absolutely necessary to have one ChanIn statement for each ChanOut statement. It is possible to have a sending node execute two ChanOut statements transferring 50 bytes each to a receiving node which executes only one ChanIn statement for 100 bytes. As long as the total number of bytes sent matches the total number of bytes received, everything will work fine. The code segment below shows how Node 2 can send four characters separately to Node 1 which gets them as one integer (the transputer is a 32-bit processor, representing integers as 32 bits, or 4-bytes). Although this does not make much sense with characters and integers, this property will come in handy when we work with arrays, where arrays can be expected in full (only one ChanIn) but sent in small sections (several ChanOuts).

	int x;
	char a, b, c, d;

	if (_node_number == 1)
	{
		...
		 ChanIn(LINK1IN, &x, (int) sizeof(int));
		...
	}
	if (_node_number == 2)
	{
		 ...
		ChanOut(LINK0OUT,&a,1);
		ChanOut(LINK0OUT,&b,1);
		ChanOut(LINK0OUT,&c,1);
		ChanOut(LINK0OUT,&d,1);
		...
	}

[Previous] [HOME] [NEXT]