Assignment 1 : Simple Matrix Multiplication
Fabian Prada
1) Calling the Device multiplication function from the Host:
void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P){
Matrix Mdevice=AllocateDeviceMatrix(M);
Matrix Ndevice=AllocateDeviceMatrix(N);
Matrix Pdevice=AllocateDeviceMatrix(P);
CopyToDeviceMatrix(Mdevice,M);
CopyToDeviceMatrix(Ndevice,N);
dim3 dimGrid(1,1);
dim3 dimBlock(MATRIX_SIZE,MATRIX_SIZE);
MatrixMulKernel<<>>(Mdevice,Ndevice,Pdevice);
CopyFromDeviceMatrix(P,Pdevice);
cudaFree(Mdevice.elements);
cudaFree(Ndevice.elements);
cudaFree(Pdevice.elements);
}
2) Device multiplication function:
__global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P){
int tx=threadIdx.x;
int ty=threadIdx.y;
float cummulativeVal=0.f;
for(int k=0; k< MATRIX_SIZE; k++){
cummulativeVal+=M.elements[MATRIX_SIZE*tx+k]*N.elements[ty+MATRIX_SIZE*k];
}
P.elements[tx*MATRIX_SIZE+ty]=cummulativeVal;
}
3) Device performance:
- How many times is each element of the input matrices
loaded during the execution of the kernel?
Each element of the 16x16 input matrices, M and N, is loaded 16 times during the execution of the kernel. In my implementation the thread (tx,ty) calculates the output value in matrix P located at tx-row and ty-column. To do such calculation, this thread requires to acces global memory and load 16 values from each matrix: M_(tx,1),M_(tx,2),...,M_(tx,16), and N_(1,ty),N(2,ty),...,N(16,ty). Therefore an element of matrix M, say M_(a,b), is loaded from global memory by all the threads that calculates elements at the a-row of matrix P, those are threads with indices (a,1), (a,2),...,(a,16). Analogusly, an element N_(c,d) is loaded from global memory by all the threads that calculates elements at the d-column of matrix P, those are threads with indices (1,d),(2,d),...,(16,d).
In the case of nxn matrix multiplication each element of the input matrices is loaded n times.
- What is the memory-access to floating-point computation
ratio in each thread?
The thread (tx,ty) computes the value of P_(tx,ty). This thread performs 32 memory accesses to load the values M_(tx,1),M_(tx,2),...,M_(tx,16), and N_(1,ty),N(2,ty),...,N(16,ty). Along the computation it does 16 multiplications ( M_(tx,1)*N_(1,ty),M_(tx,2)*N(2,ty),..., and M_(tx,16)*N(16,ty)) and 16 sums (0+M_(tx,1)*N_(1,ty)+M_(tx,2)*N(2,ty))+...+M_(tx,16)*N(16,ty)). Therefore, the total number of memory accesses and floating point computations (per thread) are both 32. Then, the memory-access to floating-point computation ratio in each thread is 1.
This situation also holds for multiplication of nxn matrices, 2n memory accesses : 2n float point computations per thread.