| 9-1 |
The N-body program we just explored is not as efficient as we could have made it. The inefficiency stems from the fact that the Computation/Shift actions do not overlap. In any given step, the transputers with odd Id numbers go through the following sequential steps:
S1 Send own slice to the right. Write a program based on the N-body program above that implements latency hiding, and measure its performance against that of the original program. Find the optimal size of the blocks shifted around the ring, in terms of execution time. . |
The goal is to place N queens on an N by N chessboard, so that no queens can take each other. Because queens can move horizontally, vertically, and diagonally, this means that there can be only one queen per row and one per column, and that no two queens can find themselves on the same diagonal. Figure 9-7 shows a solution for N=8. Finding a solution for a dimension N requires creating a search tree where each node represents a valid position of a queen on the chess board. Nodes at the first level correspond to one queen on the N by N board. Nodes at the second level represent boards containing two queens in valid locations, and so on. When a tree of depth N is found then we have a solution for positioning N queens on the board. Figure 9-8 shows the basic (and incomplete) structure of such a tree.
Partitioning in this case can be done by assigning subtrees to the individual processors, allowing the search to be done in parallel on different subtrees. The program stops when a processor reaches a terminal leaf (success), or when all the subtrees have been visited without ever reaching a terminal leaf (no solutions).
For problems that exhibit such irregular domains that grow and shrink during the computation, partitioning must often be carried out dynamically by techniques that ensure that each processor has some work to do, and that progress is made towards a solution. The Manager/Worker paradigm is such a technique, and we will study it shortly in the chapter on load balancing.
| 9-3 |
The N-Queen problem is interesting for many reasons. One of them
is that the parallelism that we have chosen to implement may not be the best
suited for this application, as Stone et al. have shown [STON86, STON87]. In [STON87], Stone and Stone show that finding a solution to the N-queens problem can be found extremely fast if one adopts the following rule to position the next queen on the board: "for the next placement, place a queen in the row that has the fewest placement choices." This means that the next queen is not placed automatically on the next row down, but on the row that has the fewest possible positions. In case of ties, one of the rows is chosen randomly. The power of this scheme is that by selecting the next position as the one offering fewer choices, the number of backtracks--the program "undoing" part of its previous layout to try new queen settings--is tremendously reduced. As a result a solution is found much faster. For example, Stone and Stone report that for the 29-Queen problem, the program of Exercise 9-2 requires 1,532,210 backtrack operations, while the "most-constraint next" placement strategy requires only 313 backtracks. Implement a parallel version of Stone and Stone's algorithm on a network of transputers. |
(9.3)
Each term Cij is the dot product of the ith row of A and the jth column of B:
Hence every element of C is the combination of a row of A and a column of B. Mapping the data to the processors should take into consideration not only the properties of the domain (square matrices), but also the data-dependencies created by the problem.
Let's try to figure out a way to break down the C matrix among processors, so as to obtain an efficient decomposition, both in terms of its performance and in terms of the programming effort that will be needed on our part. In cases where the size of the matrix is larger than the number of processors, we will have to break C down into subsets to be distributed to each processor. It might be tempting to assign a group of rows (or columns) of C to each processor. For example, assume that one processor is in charge of computing the first row of C: C11, C12, ..., C1N. To perform this computation it will need the first row of A, and the whole matrix B. Therefore, a mapping requiring each processor to compute individual rows of C, would require the matrix B to be stored in its entirety at each node of the processor network. If, on the other hand, we require a processor to compute the first column of C, then each terms in that column will depend on a different row of A, and on the first column of B. So we just reversed the problem.
Although these solutions are acceptable, they are not the most memory efficient. A solution reminiscent of the N-body problem is to divide the domain into square blocks, or slices, that can move from processor to processor as the computation goes on, and to organize the processors in a square mesh.
Assuming that we have P processors organized as a p by p array (P=p2), each matrix is laid out on the square array of processors and each processor is assigned a square sub-block of the three matrices.
Each processor is assigned three sub-blocks: the sub-block of the C matrix that it must compute, one sub-block of the A matrix sharing the same row indexes as the elements of the C sub-block, and one sub-block of the B matrix sharing the same column indexes as the element of the C sub-block. Because the processors only contain parts of the A rows and B columns needed to compute the terms of their C sub-block, they must exchange A sub-blocks with processors in the same horizontal row of the processor array, and they must also exchange B sub-blocks with processors in the same vertical column, as shown in Figure 9-9.

All right, things are going to get slightly complicated, so let's follow a simple example of the case where A, B, and C are 4 by 4 matrices, and the sub-blocks are of size 1. That is, we have 16 transputers organized as a 4 by 4 toroidal mesh. Assigning indexes to the transputers, as shown in Figure 9-10, we have each Transputer Tij in charge of computing Cij. Let's investigate first a mapping where we assign Aij and Bij to Tij as well. This will allow us to explore the pattern of compute/communicate actions. We concentrate on Transputer T22.