Is there any way to allocate a standard Rust array directly on the heap, skipping the stack entirely?

Summary: your benchmark is flawed; just use a Vec (as described here), possibly with into_boxed_slice, as it is incredibly unlikely to be slower than a heap allocated array.

Unfortunately, your benchmarks are flawed. First of all, you probably didn’t compile with optimizations (--release for cargo, -O for rustc). Because if you would have, the Rust compiler would have removed all of your code. See the assembly here. Why? Because you never observe the vector/array, so there is no need to do all that work in the first place.

Also, your benchmark is not testing what you actually want to test. You are comparing an stack-allocated array with a heap-allocated vector. You should compare the Vec to a heap allocated array.

Don’t feel bad though: writing benchmarks is incredible hard for many reasons. Just remember: if you don’t know a lot about writing benchmarks, better don’t trust your own benchmarks without asking others first.

I fixed your benchmark and included all three possibilities: Vec, array on stack and array on heap. You can find the full code here. The results are:

running 3 tests
test array_heap  ... bench:   9,600,979 ns/iter (+/- 1,438,433)
test array_stack ... bench:   9,232,623 ns/iter (+/- 720,699)
test vec_heap    ... bench:   9,259,912 ns/iter (+/- 691,095)

Surprise: the difference between the versions are way less than the variance of the measurement. So we can assume they all are pretty equally fast.

Note that this benchmark is still pretty bad. The two loops can just be replaced by one loop setting all array elements to LENGTH - 1. From taking a quick look at the assembly (and from the rather long time of 9ms), I think that LLVM is not smart enough to actually perform this optimization. But things like this are important and one should be aware of that.

Finally, let’s discuss why both solutions should be equally fast and whether there are actually differences in speed.

The data section of a Vec<T> has exactly the same memory layout as a [T]: just many Ts contiguously in memory. Super simple. This also means both exhibit the same caching-behavior (specifically, being very cache-friendly).

The only difference is that a Vec might have more capacity than elements. So Vec itself stores (pointer, length, capacity). That is one word more than a simple (boxed) slice (which stores (pointer, length)). A boxed array doesn’t need to store the length, as it’s already in the type, so it is just a simple pointer. Whether or not we store one, two or three words is not really important when you will have millions of elements anyway.

Accessing one element is the same for all three: we do a bounds check first and then calculate the target pointer via base_pointer + size_of::<T>() * index. But it’s important to note that the array storing its length in the type means that the bounds check can be removed more easily by the optimizer! This can be a real advantage.

However, bounds checks are already usually removed by the smart optimizer. In the benchmark code I posted above, there are no bounds checks in the assembly. So while a boxed array could be a bit faster by removed bounds checks, (a) this will be a minor performance difference and (b) it’s very unlikely that you will have a lot of situations where the bound check is removed for the array but not for the Vec/slice.

More Related Contents:

Leave a Comment Cancel reply