This textbook is about parallel programming. Unlike other books in this area that present a breadth-wise approach to introduce the concepts and issues related to parallel programming, including examples of partial programs written in different languages for different machines, I have taken the approach of presenting an in depth coverage of a particular type of machine, using one language. This book teaches how to create parallel programs written in C for transputer-based multiprocessor machines.
My approach is a to provide a complement to the material covered by those textbooks, rather than an alternative. Learning about performance issues such as clearly defining the conditions under which the speedup of an application should be measured takes on an added dimension when the reader is confronted with the exercise of writing, measuring and tuning a parallel program. When such issues are learnt in the same hardware context, related problems in different architectures or different parallel paradigms become more perceptible, easier to grasp and understand. At the same time an expertise is built up, and an acute perception of material dealing with parallel processing is acquired.
This book is intended for an undergraduate audience who is already familiar with the C programming language, and who has a good grasp of data structures and some knowledge of algorithmic complexity. The material provided in the book can support a semester course devoted to parallel programming in a message-passing environment, or can be used as the support for a section of a course on parallel processing where several parallel organizations are investigated. A one semester course presenting three different approaches to parallel processing, such as data-parallelism with parallelaxis, heterogeneous processing under PVM (Parallel Virtual Machines), and message-passing with the transputers for example, present an interesting diversified approach to parallelism.
Engineers and scientists who are interested in investigating parallel computing at a reasonable cost, or who want to boost the performance of their personal computers or workstation by adding a parallel attached processor will also find this book interesting, with the progressive introduction of the parallel concepts that have been added to ANSI C, and the full listings of programs along with their availability on the diskette distributed with the book.
1-1 Transputer-based multiprocessors and parallel architectures
The evolution of computer architectures has been very active since the construction of the IAS Computer based on the design of John von Neumann in 1945. The taxonomy of computer architectures is now a tree with many branches and leaves, parallel computers representing the "greenest" part of the tree. Although an attempt at cataloguing every parallel design precisely is beyond the scope of this book (refer to Hwang's comprehensive study instead [HWAN93]), it is meaningful to see where transputer-based computers sit in the tree and identify the other architectures with which they are related.
The classification of computer architectures established by Flynn in 1972 [FLYN72] does not fully embody the diversity and cross-influences that are present in modern computers, but it is still popular because it imposes a reassuring order and maintains a level of simplicity much needed in the rich fauna of today's emerging architectures. sFlynn's categorization is based on the way an architecture manages the flow of instructions that operates over the data. The single processor machine whose first attributes were so accurately defined by von Neumann in 1945 [VONN45] is termed a Single Instruction stream, Single Data stream machine, or SISD. In this model, at any given time, only one datum can be operated on, and the operation is defined by one instruction only.
Early on, parallelism was introduced in this model in order to gain performance. The first places where it appeared was in look-ahead mechanisms whose goal was to prefetch instructions so as to overlap instruction fetch and instruction execution. Soon this method generalized to pipelining (exploiting parallelism in time), and to vector operations (exploiting spatial parallelism) where one instruction can now operate on several data.
The next general step in that direction was to have computers with one control unit, receiving one instruction which directed many processing units, each performing the actions required by the control unit over different data. This one-control-unit, many-processing-units combination is an example of SIMD architectures in Flynn's taxonomy: Single Instruction stream over Multiple Data streams. Thinking Machine's CM-2 supercomputer of 65,536 processing units typifies the SIMD approach. On the vector side of the SIMD classification, we have the Cray Y-MP and C-90, Digital's VAX 9000, the family of Convex C3800 machines as examples of vector supercomputers. They are supercomputers in that they not only use vector unit, they also pipeline them, exploiting parallelism both in time and space.
It is natural that Flynn's classification contain two more branches, corresponding to the combinations of "Single" and "Multiple" attributes for the Instruction and Data streams. MIMD, Multiple Instruction streams on Multiple Data streams corresponds to the environment we explore in this book. Different instructions are executed simultaneously in different processors acting on different data, with the added requirement that they all contribute to solving a common problem. We are now looking at the category of multiprocessors and multicomputers. Although these two terms are used interchangeably more and more often, the accepted convention is that a multiprocessor refers to a machine where multiple processors access a common, shared, memory, while in multicomputers the memory is distributed, each processor having access only to its own memory. Transputer-based machines are thus MIMD machines, and more precisely multicomputer machines. They belong to message-passing architectures, which is one of the two major sub-categories of the MIMD classification. Message-passing is the technique processors must use to exchange information with each other, and because no remote memory accesses are possible, they are also called NORMA machines (NO Remote Memory Access). Transputer-based systems such as the Parsys Ltd. SuperNode1000, or Computer Systems Architect SuperSet-64 are NORMA multicomputers. Other multicomputers include the Intel Paragon XP/S of i860 processors, and the nCube/2 computers.
The other major sub-category of MIMD machines contains shared-memory architectures, where the processors share the same memory space, often through the use an interconnection network. Depending on whether the processors have uniform or non uniform access to the memory, or whether the memory is represented only in caches associated with the processors, we have the subcategories of UMA (Uniform Memory Access), NUMA (Non Uniform Memory Access) or COMA (Cache Only Memory Access). The Sequent Symmetry computer is an example of an UMA architecture. The BBN Butterfly, on the other hand is representative of a NUMA system, while the Kendal Square Research KSR-1 is a recent COMA architecture.
In many of the parallel programs I create in this book, I introduce a different flavor to MIMD computation: that of a whole program distributed to all the processors of a multicomputers, but such that different processors execute different sections of the program on different parts of the data stream. This same program, multiple data streams paradigm is referred to as SPMD, and is not part of Flynn's original decomposition.
This quick tour of the family of current parallel computer architecture shows that this is an extended and fast-growing family, with many organization and programming paradigms. This is another reason why I propose a two level-approach to studying parallelism. One is broad in nature, surveying the major architectures and the problems characterizing their programming. The other, integrated in this book, concentrates on one architecture and provides a concrete support for definitions and concepts that could otherwise remain general and less well understood.
Transputer multicomputers are appealing for two additional, unrelated reasons. The first one is flexibility, both in the implementation of user-configurable networks, and in a growing software base, which includes, among other languages, C, C++, Occam, Pascal, Modula-2 and Prolog. The second reason is affordability. One- and four-transputer boards are now available for most popular personal computers and workstations, making transputer-based multiprocessors available for as low as $450[1] per node, bringing the exploration of parallel programming within the reach of most universities and colleges, and even individuals.
1-2 Flexibility and versatility
A transputer circuit contains a processor, a small amount of memory, a co-processor, and four high-speed bi-directional links. Although such simple a list may not be impressive, the elegant design of the transputer provides a rich environment for parallel processing.
In a teaching environment, these three features make the transputer appealing for learning parallel programming, and for experiencing the challenges associated with it. Implementing Data sharing in a parallel environment, which is often introduced in an operating systems course can be studied first hand with multitasking on a single processors. Heterogeneous processing and scheduling issues can be discussed with examples where the processor of the host machine cooperate with an attached transputer in solving a problem. Running several logical node operations of a single processor is also an important step in debugging and mapping large applications to small processor networks.
These examples illustrate the wealth of issues that can be addressed through the versatility and flexibility of a transputer system. In this book I have tried to approach them methodically, along with the paced introduction of programming concepts, in order to form a general frame where the reader builds up strength and expertise along the way, tackling more challenging problems as the progression continues
The book is based on Logical Systems C, Version 93.1. Logical Systems has enhanced ANSI C with parallel extensions in the form of a separate library (conc) containing new macros and functions. This results in a clean extension of C where all the operators and constructs are ANSI compatible, and where the parallelism is introduced through the use of functions. This feature has the added and surprising benefit that any ANSI C compiler running on the host machine (PC compatible, Macintosh or workstation) can be used to check the programs for syntax errors. As some of these C compilers have sophisticated integrated environments, they offer an unexpected practical and efficient medium for developing parallel C programs .
This book assumes that the host machine is an IBM PC or compatible, as all the exercises and examples were developed for this platform. However, Macintosh users will find that besides the actual installation of the software and the command syntax for compiling and running programs, there is actually very little "PC" specific details in the book, and most programs and examples can be directly ported to that target machine.
1-4 Why use C?
Why use C when the transputer supports a language developed especially for it: Occam? Good arguments can be made for both languages. My main reason for using C is that it is a language most computer science students and scientists are familiar with, and a much faster learning curve can be expected in this case. The learning effort is spent dealing with the parallel constructs, rather than with the complexity of a new language adorned with parallel constructs.. This is an important issue, since creating and running a parallel program on the transputer is not a trivial process, requiring the exact mapping of the parallel application to the physical network. Familiarity with the language becomes especially important when message passing is one of several paradigms explored during the course of a semester. In such cases, the added efficiency of an ANSI C integrated environment, as discussed above, becomes increasingly attractive.
Another major benefit of programming in C is that a large collection of sources of parallel algorithms contains descriptions written in C or in C-like languages, making their implementation on transputers a relatively easy process.
1-5 Chapter Organization
Our first stop is to look at the architecture of the transputer. This is a worthwhile first stop, not only because of the interesting architectural design decisions that have been made to create a processor supporting parallelism, but because a good understanding of the hardware makes for its efficient utilization, and for good programming. This chapter is first not so that it can be skipped easily and programming can be introduced right away. Instead, we explore the hardware first because it provides the framework for the parallel constructs that are developed in the chapters that follow. For example, predicting the behavior of concurrent tasks in the transputer is much more intuitive once one understands how the processor manages tasks of different priorities, and how timers are used to schedule tasks. In addition, the knowledge of the architecture and a keen understanding of the interactions between the processor, the links, the timers and the memory become paramount when performance is the factor driving the programming effort.
Chapter 3 develops the mechanics for creating and running parallel programs written in Logical Systems C. With this background and the understanding of the material covered in Chapter 2, Chapters 3, 4 and 5 explore the parallel constructs involved in creating concurrent tasks and in implementing data communication.
After completing Chapter 5, we are ready to write sophisticated parallel programs running on several transputers. All the major parallel constructs have been presented and studied in the context of different examples, and it is a good time to discuss debugging issues, and to present and recommend good programming techniques that will form a framework where debugging can be done efficiently and successfully.
With Chapter 7 we fully enter the domain of parallel processing and start with a condition that is inherent to parallel computation: deadlocks. We study different conditions under which deadlocks can occur and look at some preventive solutions. Again, I stress here good programming practices and strengthen a general framework with programming rules that should be enforced when writing parallel code.
Chapter 8 introduces the major performance metrics used to rate parallel programs, and provides examples of how they can be measured in practice. Execution time, speedup, processor utilization and efficiency, throughput are covered in details. This chapter provides the tools needed to undertake the fine tuning of parallel applications, when performance must be maximized.
In Chapter 9 and 10 we introduce techniques for decomposing parallel application and creating parallel programs. Most of these techniques are the source of many on-going research efforts, and we look at the problem of scheduling concurrent tasks, partitioning regular and irregular data domains among processors, associating data with processors. Chapter 10 is reserved for the problem of load balancing.