Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!wuarchive!decwrl!megatest!djones From: djones@megatest.UUCP (Dave Jones) Newsgroups: comp.lang.c Subject: Re: Perfect HASH functions..... Message-ID: <8306@goofy.megatest.UUCP> Date: 26 Sep 89 23:11:27 GMT References: <1989Sep24.214153.8867@rpi.edu> Organization: Megatest Corporation, San Jose, Ca Lines: 147 From article <1989Sep24.214153.8867@rpi.edu>, by flynn@clotho.acm.rpi.edu (Kevin Lincoln Flynn): > As long as people are talking hashing, can anyone think of a GOOD way to hash > 16-bit numbers determined dynamically at runtime into a fast way to look them > up? Specifically, I'm talking about the attribute dictionary of an object in > an OO system we're working on here -- the tags of attributes are 16-bit handles > about which nothing is known at runtime. Someone suggested closed hashing, > with the number of buckets growing by roughly doubling primes.... ie start > with two buckets, then 3, then 7, then 17, etc..... any ideas out there? > (oh yeah, the hash function as well as the strategy'd by nice. [ :) ]) > Reluctant as I am to contribute to your delinquency by aiding and abetting dynamic attribute binding, well okay. :-| For a hash-function, it depends on the form of those 16-bit numbers. If all the bits are more or less equally important, you may just want to multiply by a peculiar number. Try 9821. Read Sedgewick's "Algorithms" or something by Knuth on pseudo-random number generation. Do some experiments to see what kind of collision rates you get with different numbers and different table-sizes. Numbers that end with x21, where x is even may avoid some problems. I don't pretend to understand this. Now then. Concerning the strategy.... I have never been convinced that it is advantageous to use prime numbers for sizing hash tables. Nobody has ever been able to do much more by way of convincing than to wave books at me. But when you read those books, they just say it's a Good Thing. No analysis of why it is a Good Thing. My reasoning is that if the hash-function is good, then the algebraic properties of the size of the table don't matter. Random squared is random. Or in this case, pseudo-random squared is pseudo-random. Why am I telling you all this? Well, even on some modern machines, dividing is expensive. I just ran a little test on my Sun 3/60 and it did 10,000,000 bitwise "&" operations five and a half times as fast as it did the same number of divide operations. I have one hash-table package that I use almost exclusively and it seems to give very good results. It keeps tables that are powers of two. Therefore to mod out the size of the table, it masks the hash-value with one & operation. I keep the table strictly less than half full, in order to keep the number of collisions low. The less-than-half-full property allows me to use a rather peculiar rehashing scheme that I dreamed up. It is based on a very remarkable property that I discovered, I don't know how. Too much coffee, probably. The property is that if M is a power of two, then the sequence S(0) = 0 S(n+1) = ((S(n)+1)*3) % M repeats after exactly M/2 steps. Since the "% M" can be done with an & operation, and the "*3" can be done with a shift and add, this is a fast rehash to calculate. And the algorithm does not have to check to see if it has cycled, because by keeping the table half full, we have guaranteed that the this sequence will have an empty slot in it. I've got a routine that puts something in the table if nothing else equivalent was there, or returns the equivalent thing. And a routine that replaces the equivalent thing and returns the old value, a routine that just looks for something, etc., etc... I even have a routine that removes entries. The average time (which should be always, if the hash- function is good) is O(1) for both insertion and removal. Worst case for insertion is O(n) and for removal is O(n**2), but that doesn't matter, of course. Here's a sample. (The way the bit-mask for modding out the power of two is calculated assumes a two's complement machine. I don't ever expect to use the other kind.) And oh yes. Please forgive the dynamic binding of the "hash" and "eq" attributes. In the OO attribute looker-upper you will no doubt want to do the right thing and expand the hash and eqivalence functions in line. #define HASH(cont) (((*(obj->hash))(cont)) & obj->mask ) #define REHASH(num) (((((num)+1)*3) & obj->mask) ) static Ptr* new_table(); static int round_up_to_pwr2(); static void Hash_overflow(); Hash* Hash_init_sized(obj, hash, eq, initial_table_size) register Hash* obj; int (*hash)(); Bool (*eq)(); { if(initial_table_size < 4) initial_table_size = 4; obj->size = round_up_to_pwr2(initial_table_size); obj->num_entries = 0; obj->max_entries = obj->size/2 - 1; obj->mask = obj->size - 1; obj->hash_table = new_table(obj->size); obj->hash = hash; obj->eq = eq; return obj; } /* Put a new entry into the table. If there was previously an * equivalent entry in the table, return it. * If there was not, return (Ptr)0. */ Ptr Hash_put(obj, entry ) register Hash* obj; Ptr entry; { register int bucket_number; register Ptr* bucket; bucket_number = HASH(entry); while(1) { bucket = obj->hash_table + bucket_number; if ( *bucket == (Ptr)0 ) { *bucket = entry; obj->num_entries++; if ( obj->num_entries > obj->max_entries ) Hash_overflow(obj); return (Ptr)0; /* <======== added new entry */ } if ( !obj->eq( entry, *bucket ) ) { bucket_number = REHASH(bucket_number); continue; /* <====== search some more (collision) */ } /* Found old Ptr. Replace. */ { Ptr old = *bucket; *bucket = entry; return old; /* <============== replaced entry */ } } }