Reduce matrix rows with CUDA

Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.

Kernel Approach

Here’s a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.

Multiple thread blocks each reduces one part of the data;
One thread block reduces from result of stage 1 to the final 1 element.

For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by @Novak. This may improve the performance more, especially for matrices with bad shape.

Thrust Approach

General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.

However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.

cuBLAS Approach

Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.

y = A * ones(size(A,2),1);

where y is the sum of rows of A.

cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.

Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.

Kernel Approach

Thrust Approach

cuBLAS Approach

More Related Contents:

Leave a Comment Cancel reply