globals and parfor

From the documentation on parfor: The body of a parfor-loop cannot contain global or persistent variable declarations. In the context of your problem, i.e., calling a function within the parfor that in turn references a global, this translates into: “parfor will probably not give expected or meaningful results”. This makes perfect sense. Consider the following … Read more

Compiling code containing dynamic parallelism fails

You can do something like this nvcc -arch=sm_35 -rdc=true simple1.cu -o simple1 -lcudadevrt or If you have 2 files simple1.cu and test.c then you can do something as below. This is called seperate compilation. nvcc -arch=sm_35 -dc simple1.cu nvcc -arch=sm_35 -dlink simple1.o -o link.o -lcudadevrt g++ -c test.c g++ link.o simple1.o test.o -o simple -L/usr/local/cuda/lib64/ … Read more

Different execution policies at runtime

The standard approach here is to separate the selection of a type from the use of the type: the latter takes the form of a function template instantiated several times by the former non-template function (or function template with fewer template parameters). To avoid duplicating the normal parameters between these two layers, use a generic … Read more

CUDA stream compaction algorithm

What you are asking for is a classic parallel algorithm called stream compaction1. If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements. Rough sketch: #include <thrust/copy.h> template<typename T> struct is_non_zero { __host__ __device__ auto operator()(T x) const -> bool { return x … Read more

Visualization of Java Stream parallelization

Current Stream API implementation uses collector combiner to combine the intermediate results in exactly the same way as they were previously split. Also the splitting strategy depends on the source and common pool parallelism level, but does not depend on exact reduction operation used (the same for reduce, collect, forEach, count, etc.). Relying on this … Read more

Nested Java 8 parallel forEach loop perform poor. Is this behavior expected?

The problem is that the rather limited parallelism you have configured is eaten up by the outer stream processing: if you say that you want eight threads and process a stream of more than eight items with parallel() it will create eight worker threads and let them process items. Then within your consumer you are … Read more