CUDA stream compaction algorithm

What you are asking for is a classic parallel algorithm called stream compaction1.

If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements.

Rough sketch:

#include <thrust/copy.h>

template<typename T>
struct is_non_zero {
    __host__ __device__
    auto operator()(T x) const -> bool {
        return x != 0;
    }
};

// ... your input and output vectors here

thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());

If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It’s a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.

(1) Strictly speaking, it’s not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?

Leave a Comment