CUDA stream compaction algorithm

What you are asking for is a classic parallel algorithm called stream compaction1. If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements. Rough sketch: #include <thrust/copy.h> template<typename T> struct is_non_zero { __host__ __device__ auto operator()(T x) const -> bool { return x … Read more