Assignment 2 : Matrix Multiplication using shared memory

Fabian Prada

1) Experiment 1:

In this experiment I compared three matrix multiplication strategies. The first strategy takes the original matrices M and N and padd them with zeros entries to construct rectangular matrices with sides multiples of TILE_WIDTH (I took TILE_WIDTH=16). Then it runs a Kernel where matrix multiplication uses shared memory. The second strategy does not apply any preprocessing to the initial matrices, instead, it includes some conditional statements to deal with arbitrary matrices of arbitrary dimensions. This second strategy also uses shared memory. Finaly the third strategy does not implement shared memory.

In the following figure I show the results of my implementation on a GTX 560 TI:

Observations:

The cost of the preprocessing step in the zero padding strategy is relatively high in comparision to the kernel cost (the cost relation decrease with the matrix size) and does not provide any advantage. Indeed, the cost of the kernel for the zero padding strategy (which no uses any conditional) was almost the same than the cost of the kernel in the conditional kernel strategy. In conclusion, the cost of including conditional statements was minimun.

This experiment showed that shared memory improves the performance of the matrix multiplication. The improvement respect to the case where no shared memory is used, is due to the reduction of lectures in global memory, which reduce from N lectures per entrie of each input matrix (where N is the width of the matrix) to N/TILE_WIDTH lectures per entrie.

2) Experiment 2:

In this experiment I compared the performance of my implementation for TILE_WIDTHS in the range [1,32]. The following figure shows the results obtained for the second multiplication strategy (the conditional kernel which uses shared memory) and matrices M, N of dimensions 1024x1024. The device is again a a GTX 560 TI:

Observations:

The best performance was achieved using TILE_WIDTH=16 and the second best performance was for TILE_WIDTH=32. By having blocks of 16x16 or 32x32 we are optimizing the bandwidth of reading/writting from global memory, and from shared memory when there are no bank conflicts. Since the GTX 560 TI has 1536 threads per SM, we can have just one 32x32 resident block per SM, and many unoccupied threads (512 of 1536). Instead, we can have six 16x16 resident blocks blocks per SM and all the treads occupied. Probably this is one of the reasons why TILE_WIDTH=16 performs better than TILE_WIDTH=32. On the other hand, we must also take into account that having larger blocks increases the ratio flop/global memory reading. This is a point which favorates 32x32 blocks over 16x16 blocks.

The following figure illustrate the impact of BANK CONFLICT in the performance of matrix multiplication. In my case, this situation happened by assignig to thread (tx,ty) the task of calculating the entry M(tx,ty) of the product matrix. In such case, the information load by threads (0,0),(1,0), ...,(15,0) and stored in sh[0][0],sh[1][0],....,sh[15][0], is all in the same bank of shared memory. Therefore when the warp that contains threads (0,0),(1,0), ...,(15,0) try to read the data in sh[0][0],sh[1][0],....,sh[15][0] we have 16 bank conflict!.

Despite of solving bank conflict my implementation was still 4x slower than CUBLAS.

3) Question

In your kernel implementation, how many threads can be simultaneously executing on a GeForce GTX 280 GPU?

I answer this question taking as reference my implementation of the conditional kernel which uses shared memory and TILE_WIDTH=16.

From the command nvcc --ptxas-options="-v", I obtained a memory usage of 11 registers (per thread), 2136 bytes of shared memory (per block), and 16 bytes of constant memory. I identified 3 factors which could restricts the number of threads simultaneously running on each SM:

First, the maximun number of threads that could simultenously run in a SM of GTX 280 is 1024. This means that (number of threads per block) * ( number of blocks residents in a SM) <= 1024. In my implementation a I have blocks of 16x16=256 threads, therefore, I could have at most 4 blocks residents in a same SM.

Second, the shared memory of each SM in a GTX 280 is 16kb. Then, the (amount of memory per block) * ( number of blocks residents in a SM) <= 16kb. In my implementation each block uses 2136 bytes of shared memory, therefore following this condition (and ignoring the previous one), I could have at most 7 blocks residents.

Third, each SM in GTX 280 contains 16384 registers. Then, the (amount of registers per thread) * ( number of threads in a SM) <= 16384. In my implementation each thread uses 11 registers, therefore following this condition (and ignoring the previous two), I could have at most 1489 threads.

From these observations I conclude that the maximun of threads that simultaneously run in each SM, in my implementation, is 1024. For my implementation the maximun number of threads resident in each SM is an active constraint, while the memory restrictions are in this case passive constraints. Since the GTX 280 contains 30 SMs, I finally get that the total number of threads that could run simultaneously in the device is 30720.