Para llel Programming in C for the Transputer
© D. Thiébaut, 1995



Step 1. A22 and B22 are already in the memory of T22, so the computation can start right away. In C, this yields: C22 = A22*B22;
Step 2. The transputers in the second column rotate their copies of the B elements. Then, the transputers in the second row rotate their copies of the A elements1. T22 now contains A21 and B12, which it can multiply and add to its running sum: C22 += A21*B12;
Step 3. After one more rotation of the B copies among the transputers in the second column, and a rotation of the A copies in the second row, T22 contains A24 and B42, which it can multiply and add to its running sum c22 += A24*B42;
Step 4. Final rotations column-wise and row-wise. T22 receives A23 and B32, which it can multiply together to form the final product to be added to C22. C22 += A23*B32;

At the end of Step 4, T22 contains C22, which, indeed, contains the sum described in Eq. 9-?.

Still, some more complexity

Although it looks like we have found an efficient way of computing the individual terms of the C matrix, there is still a slight detail pertaining to indexing that requires our attention. The algorithm that we just sketched seems to imply that all transputers will be able to compute their Cij term by going through four steps of compute and rotate operations. This is true only for the transputers along the diagonal of the mesh! To verify this, simply look at how the picture would look like for T23.
T23 starts with A23 and B23 in memory, but they can't be combined since the second index of A does not match the first index of B. An horizontal rotation must take place before the first product can be computed. If we had looked at T24, we would have found that two horizontal rotations would been necessary before the computation could start.

The solution is to stagger the rows and columns of the A and B matrices in the mesh, so that each processor can start the computation before any rotation is performed, and to make sure computation can be done at each node between each pair of row and column rotations.

(a)


(b)



(c)



(d)



(e)


Figure 9-1: Step by step decomposition of assignment of matrices to processor mesh. The main requirements driving the assignment is that Tij must compute Cij, and that the top left cell must contain C11, A11 and B11.

The first step is to perform the assignment for the first row. The assignment of the Cij elements is trivial: we match the i and j indexes of the transputer. Then we assign the elements of A: A11, A12, A13, and A14. Here we simply follow the indexing used for Cij elements (Figure 9-1a). Once the elements of C and A are on the first row, the assignment of B elements must be done so that each processor contains a triplet of the form Cij, Aik, Bkj (Figure 9-1b). The same technique is used for the first column, this time assigning C and B elements first (Figures 9-1c and 9-1d). Finally, the remaining elements of the matrices are assigned so that each row contains a rotation of a row of the original A matrix, and each column a rotation of a column of the original B matrix.

Indexing: Matrix Multiplication

We can now formulate our indexing mathematically. We observe that the elements of the A matrix mapped to the second row of the transputer mesh are rotated left by one position relative to their natural place in the A matrix. On the third row, they are rotated left by two positions. On the fourth row, they are rotated left by three positions. Hence Tij will be assigned Aik, where k=(i+j-2) mod 4. By the same token, we observe that the second column of B terms is rotated up by one position. The third column, by two positions, and the last column by three positions. Tij is thus assigned Bkj, where k=(i+j-2) mod 4, which is indeed what we need since we need to pair up Aik and Bkj terms.

We have now defined the indexing scheme required to assign elements of the C, A, and B matrices to the array of transputers for the matrix multiplication.

      (9-5)

where N is the size of the matrix and the size of the toroidal mesh.

Generalization of the indexing to larger matrices

The generalization of the scheme we just presented to matrices larger than the size of the toroidal mesh is immediate. Instead of containing just one element of the matrices, a transputer now contains a sub-block of the matrices, this sub-block being a small matrix in its own rights.

For example, assume that the matrices are now of size 8 by 8, but that the transputer network remains a 4 by 4 toroidal mesh. Each transputer now contains a 2 by 2 sub-block of each matrix. We shall denote a sub-blocks with a b superscript, as in Cijb. Transputer T22, at the beginning of the multiplication, will contain C22b, A23b and B32b. The multiplication step is now done as follows:

       (9-6)

Since we have now doubled the size of the matrix compared to that of the mesh of processors, each computation step of the algorithm must update four separate terms, each one with the sum of two products:

       (9-7)

Indexing requires a close inspection of the properties of the data domain, of the network of processors, and an acute understanding of the data communication patterns. This example illustrates well how a simple problem may sometimes hide complex indexing issues.

The code of a program implementing the matrix multiplication is given below.




[Previous] [HOME] [NEXT]