128 bit integer on cuda?

For best performance, one would want to map the 128-bit type on top of a suitable CUDA vector type, such as uint4, and implement the functionality using PTX inline assembly. The addition would look something like this: typedef uint4 my_uint128_t; __device__ my_uint128_t add_uint128 (my_uint128_t addend, my_uint128_t augend) { my_uint128_t res; asm (“add.cc.u32 %0, %4, %8;\n\t” … Read more