![]() |
Step 1. A22 and B22 are already in the memory of T22, so the computation can start right away. In C, this yields: C22 = A22*B22; |
![]() |
Step 2. The transputers in the second column rotate their copies of the B elements. Then, the transputers in the second row rotate their copies of the A elements1. T22 now contains A21 and B12, which it can multiply and add to its running sum: C22 += A21*B12; |
![]() |
Step 3. After one more rotation of the B copies among the transputers in the second column, and a rotation of the A copies in the second row, T22 contains A24 and B42, which it can multiply and add to its running sum c22 += A24*B42; |
![]() |
Step 4. Final rotations column-wise and row-wise. T22 receives A23 and B32, which it can multiply together to form the final product to be added to C22. C22 += A23*B32; |
At the end of Step 4, T22 contains C22, which, indeed, contains the sum described in Eq. 9-?.
![]() | T23 starts with A23 and B23 in memory, but they can't be combined since the second index of A does not match the first index of B. An horizontal rotation must take place before the first product can be computed. If we had looked at T24, we would have found that two horizontal rotations would been necessary before the computation could start. |
The solution is to stagger the rows and columns of the A and B matrices in the mesh, so that each processor can start the computation before any rotation is performed, and to make sure computation can be done at each node between each pair of row and column rotations.

(a)

(b)
(c)
(d)
(e)
The first step is to perform the assignment for the first row. The assignment of the Cij elements is trivial: we match the i and j indexes of the transputer. Then we assign the elements of A: A11, A12, A13, and A14. Here we simply follow the indexing used for Cij elements (Figure 9-1a). Once the elements of C and A are on the first row, the assignment of B elements must be done so that each processor contains a triplet of the form Cij, Aik, Bkj (Figure 9-1b). The same technique is used for the first column, this time assigning C and B elements first (Figures 9-1c and 9-1d). Finally, the remaining elements of the matrices are assigned so that each row contains a rotation of a row of the original A matrix, and each column a rotation of a column of the original B matrix.
We have now defined the indexing scheme required to assign elements of the C, A, and B matrices to the array of transputers for the matrix multiplication.
(9-5)
where N is the size of the matrix and the size of the toroidal mesh.
For example, assume that the matrices are now of size 8 by 8, but that the transputer network remains a 4 by 4 toroidal mesh. Each transputer now contains a 2 by 2 sub-block of each matrix. We shall denote a sub-blocks with a b superscript, as in Cijb. Transputer T22, at the beginning of the multiplication, will contain C22b, A23b and B32b. The multiplication step is now done as follows:
(9-6)
Since we have now doubled the size of the matrix compared to that of the mesh of processors, each computation step of the algorithm must update four separate terms, each one with the sum of two products:
(9-7)
Indexing requires a close inspection of the properties of the data domain, of the network of processors, and an acute understanding of the data communication patterns. This example illustrates well how a simple problem may sometimes hide complex indexing issues.
The code of a program implementing the matrix multiplication is given below.