What you are asking for is a classic parallel algorithm called stream compaction1.
If Thrust is an option, you may simply use thrust::copy_if
. This is a stable algorithm, it preserves relative order of all elements.
Rough sketch:
#include <thrust/copy.h>
template<typename T>
struct is_non_zero {
__host__ __device__
auto operator()(T x) const -> bool {
return x != 0;
}
};
// ... your input and output vectors here
thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());
If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It’s a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.
(1) Strictly speaking, it’s not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?