We have written recent research papers that survey the best schemes for this problem. Please see:
Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization,Software: Practice & Experience 45 (1), 2015.
http://arxiv.org/abs/1209.2137
Daniel Lemire, Nathan Kurz, Leonid Boytsov, SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience (to appear) http://arxiv.org/abs/1401.6399
They include an extensive experimental evaluation.
You can find a complete implementation of all techniques in C++11 online:
https://github.com/lemire/FastPFor and https://github.com/lemire/SIMDCompressionAndIntersection
There are also C libraries: https://github.com/lemire/simdcomp and https://github.com/lemire/MaskedVByte
If you prefer Java, please see https://github.com/lemire/JavaFastPFOR