David W: the large bitset approach sounds good, but with bitset that size you are going to encounter cache misses on almost every lookup, which is going to end up being far slower than a naive approach. You can execute a *lot* of instructions in the time it takes to load an L2 cache line.