Parallel Programming in C for the Transputer
© D. Thiébaut, 1995




EXERCISES


 
9-1

The N-body program we just explored is not as efficient as we could have made it. The inefficiency stems from the fact that the Computation/Shift actions do not overlap. In any given step, the transputers with odd Id numbers go through the following sequential steps:

S1 Send own slice to the right.
S2 Receive temporary slice from the left
S3 Compute interaction between body in temporary slice on bodies in own slice.
S4 Send temporary slice to the right.
S5 Receive new temporary slice from the left.
S6 As long as the whole domain hasn't been received, go back to step 3.

Transputers with even Id numbers reverse the Send and Receive order. A faster solution would be to have Steps 2, 3, 4, and 5 overlap in some fashion. This type of programming is know as hiding latency. The idea is that transputers should not have to wait for messages to be exchanged. Instead, they should initiate a transfer and then go back to do some work.

Write a program based on the N-body program above that implements latency hiding, and measure its performance against that of the original program.

Find the optimal size of the blocks shifted around the ring, in terms of execution time.

.


 


Irregular Domain Partitioning: The N-Queens Problem

At the other end of this spectrum of partitioning based on the data domain, we have applications that create dynamic data structures of irregular shape. Games based on strategic searches, for example, create dynamic search trees that are not regular. Let's take a look at the N-queens problem, for example[1].


Figure 9- 7: Example of a board with 8 queens.

The goal is to place N queens on an N by N chessboard, so that no queens can take each other. Because queens can move horizontally, vertically, and diagonally, this means that there can be only one queen per row and one per column, and that no two queens can find themselves on the same diagonal. Figure 9-7 shows a solution for N=8. Finding a solution for a dimension N requires creating a search tree where each node represents a valid position of a queen on the chess board. Nodes at the first level correspond to one queen on the N by N board. Nodes at the second level represent boards containing two queens in valid locations, and so on. When a tree of depth N is found then we have a solution for positioning N queens on the board. Figure 9-8 shows the basic (and incomplete) structure of such a tree.

Partitioning in this case can be done by assigning subtrees to the individual processors, allowing the search to be done in parallel on different subtrees. The program stops when a processor reaches a terminal leaf (success), or when all the subtrees have been visited without ever reaching a terminal leaf (no solutions).

For problems that exhibit such irregular domains that grow and shrink during the computation, partitioning must often be carried out dynamically by techniques that ensure that each processor has some work to do, and that progress is made towards a solution. The Manager/Worker paradigm is such a technique, and we will study it shortly in the chapter on load balancing.


Figure 9- 8: Partial search tree for the 8-queen problem. As the search progresses down the tree, more queens are added to the board. In this example we assume that queens are added on successive rows of the board, so as the search progresses down, we cover more rows of the board. When the search reaches a leaf at the Nth level, a solution has been found.


EXERCISES


 
9-2

Implement the N-queens problem on a network of transputers. Start by assigning one queen to each of the positions of the top row of an N by N board, and distribute these "boards" to each of the transputers. Then each transputer will find a position on the second top row where a queen can be placed without attacking the first queen. Then a third queen will be positioned on the third row, such that the three queens cannot take each other, and so on. When the next row down does not contain a possible position for a queen, the program must backtrack and remove the last queen put on the board and find a new unvisited position on the same row it was on.

The search continues until a transputer can put N queens on the N by N chessboard (this type of search where the search stops after the first solution is found is referred to as a terse search [NICO86]). The solution is then listed by the column position of the queens on the rows, starting with the top row down to the last row. For example, the solution shown in Figure 9-7 can be referred to as (4,1,5,8,2,7,3,6). The parallel program stops once the solution is printed.

Write your program so that the number of queens can be user-defined. Make your program keep track of the time at which the transputers run out of work and become idle. Measure the time required to find a solution for N ranging from 8 to 29. .


 

 
9-3
The N-Queen problem is interesting for many reasons. One of them is that the parallelism that we have chosen to implement may not be the best suited for this application, as Stone et al. have shown [STON86, STON87].

In [STON87], Stone and Stone show that finding a solution to the N-queens problem can be found extremely fast if one adopts the following rule to position the next queen on the board: "for the next placement, place a queen in the row that has the fewest placement choices." This means that the next queen is not placed automatically on the next row down, but on the row that has the fewest possible positions. In case of ties, one of the rows is chosen randomly. The power of this scheme is that by selecting the next position as the one offering fewer choices, the number of backtracks--the program "undoing" part of its previous layout to try new queen settings--is tremendously reduced. As a result a solution is found much faster. For example, Stone and Stone report that for the 29-Queen problem, the program of Exercise 9-2 requires 1,532,210 backtrack operations, while the "most-constraint next" placement strategy requires only 313 backtracks.

Implement a parallel version of Stone and Stone's algorithm on a network of transputers.


 


Mapping


Maintain locality

Mapping is complementary to partitioning. While partitioning is involved with the way we cut the cake, mapping deals with distributing the pieces to the guests. The important goal of the mapping action is to maintain locality. In a message passing environment, it is paramount to keep communication to a minimum, and to restrict it, as much as possible, to near-neighbors. Therefore mapping requires detailed knowledge of the processor-network and its message routing capabilities, and of the relationship between the elements of the data structure to be distributed over the processor network. Two different techniques are essentially used in mapping: indexing and hashing. Indexing is a simple assignment function that matches the mathematical description of the data structure to the description of the processor network. Hashing is similar to the concept used in hash-tables data-structures where some hashing function is used on the data to find the index of where the item should be stored in the table. In multiprocessors, hashing is used when the data domain does not have regular or symmetrical properties allowing it to be evenly distributed among the processors. Instead, a hashing function is used on each datum to generate the Id of the processor where that it will reside. If each processor is given the hashing function, then the knowledge of the location of each item of the domain in the processor network can be fully distributed. Hashing is straightforward in its implementation, but indexing can have some interesting subtleties, which we shall now explore.

Indexing

Let's take a look at the problem of matrix multiplication. Two matrices, A and B, are multiplied to yield a third matrix C. Let's assume that A and B are square and of size N by N (this assumption simplifies the problem, but our solution can be extended directly to non-square matrices).

     (9.3)

Each term Cij is the dot product of the ith row of A and the jth column of B:

     (9. 4)

Hence every element of C is the combination of a row of A and a column of B. Mapping the data to the processors should take into consideration not only the properties of the domain (square matrices), but also the data-dependencies created by the problem.

Let's try to figure out a way to break down the C matrix among processors, so as to obtain an efficient decomposition, both in terms of its performance and in terms of the programming effort that will be needed on our part. In cases where the size of the matrix is larger than the number of processors, we will have to break C down into subsets to be distributed to each processor. It might be tempting to assign a group of rows (or columns) of C to each processor. For example, assume that one processor is in charge of computing the first row of C: C11, C12, ..., C1N. To perform this computation it will need the first row of A, and the whole matrix B. Therefore, a mapping requiring each processor to compute individual rows of C, would require the matrix B to be stored in its entirety at each node of the processor network. If, on the other hand, we require a processor to compute the first column of C, then each terms in that column will depend on a different row of A, and on the first column of B. So we just reversed the problem.

Although these solutions are acceptable, they are not the most memory efficient. A solution reminiscent of the N-body problem is to divide the domain into square blocks, or slices, that can move from processor to processor as the computation goes on, and to organize the processors in a square mesh.

Assuming that we have P processors organized as a p by p array (P=p2), each matrix is laid out on the square array of processors and each processor is assigned a square sub-block of the three matrices.

Each processor is assigned three sub-blocks: the sub-block of the C matrix that it must compute, one sub-block of the A matrix sharing the same row indexes as the elements of the C sub-block, and one sub-block of the B matrix sharing the same column indexes as the element of the C sub-block. Because the processors only contain parts of the A rows and B columns needed to compute the terms of their C sub-block, they must exchange A sub-blocks with processors in the same horizontal row of the processor array, and they must also exchange B sub-blocks with processors in the same vertical column, as shown in Figure 9-9.


Figure 9- 9: (a) A sub-block of the C matrix is computed by combining the equivalent rows of A with the equivalent columns of B. (b) Assuming that each processor is assigned a sub-block of C, as well as sub-blocks of the equivalent rows of A and columns of B, the processor can combine a sub-block of A with a sub-block of B, then shift the A sub-blocks to the right and the B sub-blocks down. (c), (d), and (e): After shifting sub-blocks three times, the processor has received the whole rows of A and columns of B that corresponded to its C sub-block.

Toroidal mesh for 2-D version of N-body problem

We are faced here with a two-dimensional version of the N-body problem. Hence our square mesh of processors must be capable of shifting information from a processor on the bottom row to the one aligned vertically on the top row, and conversely for edge-processors aligned horizontally. A square mesh with this property is called a toroidal mesh. The processors on the vertical edges of the mesh are connected together, and so are the processors on the horizontal edges, as shown in Figure 9-10.


Figure 9- 10: A toroidal mesh of 16 transputers. Because the mesh uses up all four links of the transputer, an extra transputer is needed at the end of one ring to provide an interface to the Host. It is possible to implement a 4 by 4 toroidal mesh with only 16 transputers if one uses virtual channels.

All right, things are going to get slightly complicated, so let's follow a simple example of the case where A, B, and C are 4 by 4 matrices, and the sub-blocks are of size 1. That is, we have 16 transputers organized as a 4 by 4 toroidal mesh. Assigning indexes to the transputers, as shown in Figure 9-10, we have each Transputer Tij in charge of computing Cij. Let's investigate first a mapping where we assign Aij and Bij to Tij as well. This will allow us to explore the pattern of compute/communicate actions. We concentrate on Transputer T22.




[Previous] [HOME] [NEXT]