As I mention here, I’ve seen Intel TBB’s custom STL allocator significantly improve performance of a multithreaded app simply by changing a single
std::vector<T>
to
std::vector<T,tbb::scalable_allocator<T> >
(this is a quick and convenient way of switching the allocator to use TBB’s nifty thread-private heaps; see page 7 in this document)