Assignment 1 : Simple Matrix Multiplication

Fabian Prada

1) Calling the Device multiplication function from the Host:

void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P){

Matrix Mdevice=AllocateDeviceMatrix(M);
Matrix Ndevice=AllocateDeviceMatrix(N);
Matrix Pdevice=AllocateDeviceMatrix(P);

CopyToDeviceMatrix(Mdevice,M);
CopyToDeviceMatrix(Ndevice,N);

dim3 dimGrid(1,1);
dim3 dimBlock(MATRIX_SIZE,MATRIX_SIZE);

MatrixMulKernel<<>>(Mdevice,Ndevice,Pdevice);

CopyFromDeviceMatrix(P,Pdevice);

cudaFree(Mdevice.elements);
cudaFree(Ndevice.elements);
cudaFree(Pdevice.elements);

}

2) Device multiplication function:

__global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P){

int tx=threadIdx.x;
int ty=threadIdx.y;

float cummulativeVal=0.f;

for(int k=0; k< MATRIX_SIZE; k++){

cummulativeVal+=M.elements[MATRIX_SIZE*tx+k]*N.elements[ty+MATRIX_SIZE*k];

}

P.elements[tx*MATRIX_SIZE+ty]=cummulativeVal;

}

3) Device performance:

How many times is each element of the input matrices loaded during the execution of the kernel?

Each element of the 16x16 input matrices, M and N, is loaded 16 times during the execution of the kernel. In my implementation the thread (tx,ty) calculates the output value in matrix P located at tx-row and ty-column. To do such calculation, this thread requires to acces global memory and load 16 values from each matrix: M_(tx,1),M_(tx,2),...,M_(tx,16), and N_(1,ty),N(2,ty),...,N(16,ty). Therefore an element of matrix M, say M_(a,b), is loaded from global memory by all the threads that calculates elements at the a-row of matrix P, those are threads with indices (a,1), (a,2),...,(a,16). Analogusly, an element N_(c,d) is loaded from global memory by all the threads that calculates elements at the d-column of matrix P, those are threads with indices (1,d),(2,d),...,(16,d).

In the case of nxn matrix multiplication each element of the input matrices is loaded n times.

What is the memory-access to floating-point computation ratio in each thread?

The thread (tx,ty) computes the value of P_(tx,ty). This thread performs 32 memory accesses to load the values M_(tx,1),M_(tx,2),...,M_(tx,16), and N_(1,ty),N(2,ty),...,N(16,ty). Along the computation it does 16 multiplications ( M_(tx,1)*N_(1,ty),M_(tx,2)*N(2,ty),..., and M_(tx,16)*N(16,ty)) and 16 sums (0+M_(tx,1)*N_(1,ty)+M_(tx,2)*N(2,ty))+...+M_(tx,16)*N(16,ty)). Therefore, the total number of memory accesses and floating point computations (per thread) are both 32. Then, the memory-access to floating-point computation ratio in each thread is 1.

This situation also holds for multiplication of nxn matrices, 2n memory accesses : 2n float point computations per thread.