Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time.
SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.