Parallel Programming In C. Chap. II
Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



2-2 The Processor

The design of the processor part of the transputer is very interesting. Although you may be tempted to skip directly to the programming section, hold on for awhile. Some of the questions that will arise later when we start Faswriting parallel tasks will be readily answered if we know a few things about the processor hardware. Also, it is important to know where the performance of the processor comes from. We will see that several key design choices contribute to the transputer's overall performance:

No dedicated data registers. The transputer does not have dedicated registers, but a stack of registers, which allows for an implicit selection of the registers. The net result is a smaller instruction format.

Reduced Instruction Set design. The transputer adopts the RISC philosophy and supports a small set of instructions executed in a few cycles each.

Multitasking supported in microcode. The actions necessary for the transputer to swap from one task to another are executed at the hardware level, freeing the system programmer of this task, and resulting in fast swap operations.

We will investigate these three features based on the Inmos T800. The T400 is in every respect similar, except that it does not contain a floating point unit, and has only two I/O ports. The block diagram of transputer is shown in Figure 2-2.

Figure2-2: T800 Transputer block diagram.

Stack-based operation

The processor contains six registers. Three (the A, B, and C registers) are used as data registers and implement a stack. If you ever used a Hewlett-Packard calculator, you are already familiar with the reverse polish notation, and the simplicity and efficiency associated with its use. Here is how the processor evaluates the following high-level expression:

x = a+b+c;

where x, a, b, and c represent integer variables. With an evaluation stack, there is no need to specify which processor registers receive the variables. Here, the processor is simply told to load each variable, one after the other and to add them. As variables are loaded they are pushed into the stack. Every time an operation is performed, the two values at the top of the stack are first popped, then combined, and the result of the operation is pushed back onto the stack:


			;stack contents (=Undefined)
			;[	 	 ]
	load a		;[a	 	 ]
	load b		;[b	a	 ]
	load c		;[c	b	a]
	add		;[c+b	a	 ]
	add		;[c+b+a	 	 ]
	store x		;[c+b+a	 	 ]

The above code shows a fragment of a program assigning the sum of the three variables to x, written in a pseudo assembly language. We could have added the first two variables before loading the third one to get the same result. The advantage of operating with a data stack lies is that it removes the need to add extra bits to the instruction to specify which register is accessed. As a result, instructions can be packed in smaller words, the net result of which is a tighter fit in memory, and less time spent fetching the instructions from memory.

Multitasking

The process by which a processor divides its time among several programs (seemingly) running at the same time is called multitasking. Multitasking is an important way to improve the performance of a processor by allowing it to start running another program if the one that it is currently executing cannot continue for awhile. A typical reason for this to happen is that an Input/Output operation is necessary but the peripheral, or I/O controller is not available. By allowing the processor to put the first program aside and switch to another one the total execution time of both is reduced.

Multitasking is the first form of parallelism supported by the transputer, and is accomplished by having the processor maintain a list of tasks that must be executed. Each task executes for a small amount of time, called a quantum, before it is stopped, and swapped for another task (if one awaits). For the transputer, a quantum of time is typically on the order of two milliseconds (2 10-3 seconds)[1]. Task are executed in a round-robin fashion. When a task eventually terminates it is removed from the list. At any time, new tasks may be created and added to this list.

In Section 4-4 we will see how Logical Systems C supports this (simplified) paradigm, which will allow us to write programs supporting multiple parallel tasks.

At any time a transputer task may be in either one of two states:

Active
This state refers to a task that is being executed, or in the list of tasks waiting to be executed.
Inactive
This state refers to a task that is not in the list of active tasks, because either one of three conditions is preventing it from continuing execution:
  • The task is waiting for an input from one of the I/O ports
  • the task is waiting to output to one of the I/O ports, or
  • the task has been asked to stay idle until a specified time in the future.

Active tasks

The transputer maintains the active tasks chained in a linked list, and two of its internal registers are used to point to the front and rear of the list. The actual list is stored in memory, and the registers contain the memory address of the cells defining the tasks. To further increase the flexibility and power of the multitasking environment, the transputer implements two levels of priority for the tasks:

High priority tasks (level 0):
These tasks indeed have a high priority: Once they gain control of the processor, they continue executing until they complete, or until they need to transfer information over a serial link. In this case, the task is said to be blocked waiting for a link. These tasks thus enjoy an unlimited quantum, and, as a result, only tasks that are expected to run for short periods of time are given high priority.

Low priority tasks (level 1):
These tasks run whenever there are no high priority tasks active, and run for up to one quantum of time, switching in a round robin fashion. Most user-defined tasks will be low-priority.

Hence the transputer needs to maintain not one, but two linked-lists, one for low-priority and one for high-priority tasks, and uses a total of four registers to point to the front and rear of both lists.

Fast task-switching

The switching between tasks belonging to the same priority-list or to different priority-lists is handled directly by the processor, and all register updates are controlled internally, through the microcode. Removing this action from a software kernel renders this operation extremely fast: less than 1 us typically [INMO88b].

Inactive tasks

Sometimes, active tasks encounter a situation where they cannot continue running until some external event occurs. In such cases the processor deschedules the , which becomes inactive. The external event is typically the expiration of a quantum, a timer event, or an event related to information transfer over a link.

When a task changes from active to inactive, it is removed from its associated linked-list and placed in what the Inmos documentation refers to as the workspace, which is a reserved area of memory. Because inactive tasks are removed from the link-lists, the processor does not suffer any overhead from them when it scans the lists to find the next task to run.

Multitasking Rule

The instantaneous performance of a transputer is not affected by the number of inactive tasks that may exist at any given time
We will refer to the inactive tasks that are waiting for communication as blocked tasks. We will explore this concept when we look at the communication process in Section 2-4. The inactive tasks that await a specified time, though, merit some attention right now.

Timers and inactive tasks

The T800 contains two 32-bit timers. The timers are separate entities outside the processors, as shown in Figure 2-2. Each timer is associated with a priority. One timer, available to high-priority tasks, is incremented every microseconds, and cycles through all its 32-bit states in 4295 seconds (232 s), or 71 minutes and 35 seconds. The other timer is associated with low-priority tasks and is incremented every 64 microsecond, resulting in a 76-hour long cycle.

Each timer can be viewed as two registers. One, called Timer, incremented at every tick of a clock, the other, TNextReg, defining a time in the future when some event must occur.

The following scenario explains how these counters come into play. Assume that a task is designed to check the status of an I/O controller at regular intervals, say 20 times a second. Upon checking the controller and having performed its function, the task can deschedule itself, that is voluntarily change its status from active to inactive, while simultaneously specifying a time in the future when it should be rescheduled. It does so by reading the contents of Timer and by adding a time delay to it. This new value represents the time in the future when the tasks wants to be awakened, and is stored in the TNextReg register by the processor. When the contents of Timer reaches the count in TNextReg, an event occurs (interrupt) and the processor puts the task back in the list of active tasks.

More precisely, when a task performs such a descheduling action, it is removed from the list of active tasks and added to a timer queue, located in memory. If that task requested a "wake-up" time earlier than any of the times associated with the tasks in that queue, then that time is stored in the TNextReg register. When the task becomes active again, it is placed at the end of the list of active tasks, and because this list may not be empty, the task may not start right away.

The capability for tasks to determine the exact time at which they are activated is a useful feature in context where synchronization of tasks is important, for debugging purposes, or when reliability is an issue, and watch-dog timers and time-out detection must be implemented.


EXERCISES

2-1

Since the high priority timer wraps around every 71 minutes, we may run into problems when we try to time very precisely (from high priority tasks) computations that run longer than that period of time. Sketch an algorithm that can be used to detect and correct overflows of the timers, so that the execution time of extremely long running programs can be measured.

2-2

Assume that we have written a program with three parallel tasks: T1, T2, and T3, all running at low priority. T1 runs by scheduling itself via the timer, so that it is awakened every 100 ms (milliseconds), to run a computation that does not last longer than 5 ms. What is the average interval of time between actual instants of time when T1 starts its computation, assuming that T2 and T3 always enjoy full quantums? What is the minimum interval between successive starting time of T1? Maximum interval?

[Previous] [HOME] [NEXT]