Generalized sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in many applications from various domains such as machine learning, scientific computing, and graph analytics. Modern architectures (CPU and GPU) have been struggling to achieve decent performance for SpGEMM primarily due to irregularity and unpredictability of the memory accesses. In this work, we propose a custom accelerator for sparse matrix-matrix multiplication based on Field Programmable Gate Arrays (FPGAs).
We identify that the redundant accesses to the non-zero elements, the load imbalance in the processing elements, and the memory management of the partial products are the main performance bottlenecks in doing SpGEMM on FPGA. To alleviate these, we decouple the process of index matching and the multiplication to accomplish load balance. Furthermore, we exploit the FPGA on-chip memory to build a distributed data structure that holds the partial products in an order which helps the merge to be implemented efficiently.
Our accelerator can multiply large sparse matrices with any sparsity pattern. We evaluate our accelerator on a diverse set of real-world matrices show an average speedup of 3X over Intel Math Kernel Library on a CPU.