Recent Intel/AMD CPUs have a POPCNT instruction, which seems like the logical way to do this. Would be interested to see how that performs compared to these implementations.
FWIW, in gcc/g++ there are compiler intrinsics which (should) map to that instruction on CPUs where it's available: __builtin_popcnt, __builtin_popcountl and __builtin_popcountll for unsigned ints, unsigned longs and unsigned long longs respectively. Visual C++ provides equivalent functions for Windows (but I can't remember what they're called).
It does seem odd that the article misses this approach out.
Unfortunately __builtin_popcnt isn't emitting a popcnt instruction with the GCC I've got here, even using -msse4.2. I believe that very recent GCC does get this right.
Also, is there any pure C code that an optimizing compiler could make into a POPCNT instruction? I would be quite impressed if a compiler could recognize some of the snippets in the article, and compile them into a POPCNT and a compare with 1.