In this experiment I compared three matrix multiplication strategies. The first strategy takes the original matrices M and N and padd them with zeros entries to construct rectangular matrices with sides multiples of TILE_WIDTH (I took TILE_WIDTH=16). Then it runs a Kernel where matrix multiplication uses shared memory. The second strategy does not apply any preprocessing to the initial matrices, instead, it includes some conditional statements to deal with arbitrary matrices of arbitrary dimensions. This second strategy also uses shared memory. Finaly the third strategy does not implement shared memory.
In the following figure I show the results of my implementation on a GTX 560 TI:
In this experiment I compared the performance of my implementation for TILE_WIDTHS in the range [1,32]. The following figure shows the results obtained for the second multiplication strategy (the conditional kernel which uses shared memory) and matrices M, N of dimensions 1024x1024. The device is again a a GTX 560 TI:
In your kernel implementation, how many threads can be simultaneously executing on a GeForce GTX 280 GPU?
I answer this question taking as reference my implementation of the conditional kernel which uses shared memory and TILE_WIDTH=16.
From the command nvcc --ptxas-options="-v", I obtained a memory usage of 11 registers (per thread), 2136 bytes of shared memory (per block), and 16 bytes of constant memory. I identified 3 factors which could restricts the number of threads simultaneously running on each SM:
First, the maximun number of threads that could simultenously run in a SM of GTX 280 is 1024. This means that (number of threads per block) * ( number of blocks residents in a SM) <= 1024. In my implementation a I have blocks of 16x16=256 threads, therefore, I could have at most 4 blocks residents in a same SM.
Second, the shared memory of each SM in a GTX 280 is 16kb. Then, the (amount of memory per block) * ( number of blocks residents in a SM) <= 16kb. In my implementation each block uses 2136 bytes of shared memory, therefore following this condition (and ignoring the previous one), I could have at most 7 blocks residents.
Third, each SM in GTX 280 contains 16384 registers. Then, the (amount of registers per thread) * ( number of threads in a SM) <= 16384. In my implementation each thread uses 11 registers, therefore following this condition (and ignoring the previous two), I could have at most 1489 threads.
From these observations I conclude that the maximun of threads that simultaneously run in each SM, in my implementation, is 1024. For my implementation the maximun number of threads resident in each SM is an active constraint, while the memory restrictions are in this case passive constraints. Since the GTX 280 contains 30 SMs, I finally get that the total number of threads that could run simultaneously in the device is 30720.