What you want to do is compile different object files for each instruction set you are targeting. Then create a cpu dispatcher which asks CPUID for the available instruction set and then jumps to the appropriate version of the function.
I already described this in several different questions and answers