Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!decwrl!megatest!djones From: djones@megatest.UUCP (Dave Jones) Newsgroups: comp.lang.c Subject: hash algorithm Message-ID: <723@goofy.megatest.UUCP> Date: 18 Aug 88 02:18:47 GMT References: <2550078@hpisod2.HP.COM> Organization: Megatest Corporation, San Jose, Ca Lines: 159 From article <2550078@hpisod2.HP.COM>, by decot@hpisod2.HP.COM (Dave Decot): >> could anyone recommend a hashing algorithm to store a list of 10 >> digit phone numbers (area code plus 7 digits). >> It should have little or no overflow and no collisions (as few as >> possible). The list will contain more than one area code but >> about 1000 numbers per area code (ie not a totasly random sample >> but one which has a relatively common prefix) >> thank you very much. > > Treat the ten digits as an integer, n, and compute the index > as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table. > > Store the number, and the record for that number, in the first empty > hash table location at or after the index you've computed. ... There are a couple of rather serious problems with the algorithm suggested. One is that the high order bits in the phone numbers will not affect the hash. Say for example you store all of radio station KGO's talk numbers, (415) 263-TALK, (408) 469-TALK, etc. (or whatever they are). They all map to the same slot. Secondly, the "first-empty-slot-after" algorithm tends to build up clumps which the subsequent linear searches have to trace through, adding to the size of the clump, and thus the likelyhood that it will get hit next time. I have devised a method for hashing which has not been published previously. It is very very fast. Here is the WORLD PREMIERE!! First let's settle the business of how to hash integers into the range 0..M-1. I suggest that you use hash(I) = (I*32821) % M The number 32821 has some magic properties that spatter things about nicely. See "Algorithms" by Sedgewick. My hashing technique uses a table-size M which is always a power of two. This makes the modulo-function (%) very fast on a two's compliment machine, if you are willing to accept a tiny bit of machine-dependance. (I never expect to use a machine that is not two's compliment, so I'm not too scared.) You code the mod function as H & mask rather than H % table_size where mask is table_size-1. For a "rehash" function, rather that looking at the next slot, I use the function defined by the following recursive definition: rehash(0,slot) = slot rehash(i+1, slot) = ((rehash(i,slot)+1)*3) mod M When M is a power of two, this fuction cycles after precisely M/2 steps. I have proved as much, but my proof is tedious. The range of the sequence is the set of numbers in 0..M-1 which are congruent to slot or slot+3 modulo 4. Don't ask me how I discovered this; I think it came to me in a vision. When the table gets half full, I double its size and hash everything into the new table. The routines allow for removing entries, as well as adding them and looking them up, but I have omited the removal routine, which is quite tricky. Here is an excerpt from the actual code: (For purposes of the telephone number problem, we may assume that "entry" is a record which contains a telephone number and some other stuff, and that obj->eq is a pointer to a function that returns true when two entries have the same tele-number in them and false otherwise. obj->hash is a pointer to a function that multiplies the tele number in an entry by 32821.) The following provably terminates because the rehash sequence hits exactly half the slots, and the table is kept less than half full. /* CAVEAT: The following assumes that integers are two's complement. */ #define HASH(cont) (((*(obj->hash))(cont)) & obj->mask ) #define REHASH(num) (((((num)+1)*3) & obj->mask) ) Ptr Hash_put(obj, entry ) register Hash* obj; Ptr entry; { register int bucket_number; register Ptr* bucket; bucket_number = HASH(entry); while(1) { bucket = obj->hash_table + bucket_number; if ( *bucket == (Ptr)0 ) { *bucket = entry; obj->num_entries++; if ( obj->num_entries > obj->max_entries ) Hash_overflow(obj); /* double size of table */ return (Ptr)0; /* <======== added new entry */ } if ( !obj->eq( entry, *bucket ) ) { bucket_number = REHASH(bucket_number); continue; /* <====== search some more (collision) */ } /* Found old Ptr. Replace. */ { Ptr old = *bucket; *bucket = entry; return old; /* <============== replaced old entry */ } } } Ptr Hash_get(obj, entry ) register Hash* obj; Ptr entry; { register int bucket_number; register Ptr* bucket; bucket_number = HASH(entry); while(1) { bucket = obj->hash_table + bucket_number; if ( *bucket == (Ptr)0 ) { return (Ptr)0; /* <====== entry not found */ } if ( !obj->eq( entry, *bucket) ) { bucket_number = REHASH(bucket_number); continue; /* <====== search some more (collision) */ } return *bucket; /* <====== found entry */ } }