Dictionary Hash Function

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
Kow
Posts: 195
Joined: Sat Apr 19, 2008 12:37 pm UTC

Dictionary Hash Function

Postby Kow » Thu Dec 10, 2009 11:59 am UTC

Recently I've been assigned a project to make a dictionary hash table (among other things). We're provided the dictionary file and are required to load it in using a hash function of our own design. The catch, however, is that we need at most 5% of the buckets in the table to be empty. I managed to make one based off character values and such, but it only gets it barely to 5%. I highly suspect that there are more efficient algorithms existing and was wondering if any of you guys had any ideas, knew of any, or even could just direct me to the right path of logic for it.
Image

User avatar
Berengal
Superabacus Mystic of the First Rank
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway
Contact:

Re: Dictionary Hash Function

Postby Berengal » Thu Dec 10, 2009 8:05 pm UTC

First thing I'd do is find the distribution of hashes the hash function creates for the dictionary. If that's fine, you'll have to take a look at the hash table implementation itself. If not, screw around with it. You want to create a function that cares about the order of characters, not just the characters themselves, at least, and it's good to throw primes or semi-primes into the mix. 33 is a good constant for strings.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.

User avatar
Kow
Posts: 195
Joined: Sat Apr 19, 2008 12:37 pm UTC

Re: Dictionary Hash Function

Postby Kow » Thu Dec 10, 2009 9:19 pm UTC

Berengal wrote:First thing I'd do is find the distribution of hashes the hash function creates for the dictionary. If that's fine, you'll have to take a look at the hash table implementation itself. If not, screw around with it. You want to create a function that cares about the order of characters, not just the characters themselves, at least, and it's good to throw primes or semi-primes into the mix. 33 is a good constant for strings.


Could you elaborate on this a little more? I don't seem to get what you mean by order of characters. Well, I know what it means literally, but how's that apply to making it a good hash function?

I manged to find one that is completely filled, but the spread is horribly uneven. Minimum bucket size of 1, max of 127, average being 33.4. What techniques are used to make a more uniform distribution?
Image

User avatar
MHD
Posts: 630
Joined: Fri Mar 20, 2009 8:21 pm UTC
Location: Denmark

Re: Dictionary Hash Function

Postby MHD » Thu Dec 10, 2009 9:27 pm UTC

Look into cryptographic algoritms, you might get some inspiration from those.
EvanED wrote:be aware that when most people say "regular expression" they really mean "something that is almost, but not quite, entirely unlike a regular expression"

Unparallelogram
Posts: 336
Joined: Mon Jul 28, 2008 4:16 am UTC

Re: Dictionary Hash Function

Postby Unparallelogram » Thu Dec 10, 2009 10:25 pm UTC

So if you take a character and convert it to an int, for example, you'll get some bits set among the least significant few. What you can do is make a combining function that distributes this entropy among all the bits of your output, and does so that later characters it receives are mixed differently from earlier ones. An example would be exclusive-or-ing the current character into your output, then shifting it so each character hits a slightly different spot.

User avatar
Berengal
Superabacus Mystic of the First Rank
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway
Contact:

Re: Dictionary Hash Function

Postby Berengal » Thu Dec 10, 2009 10:34 pm UTC

What it all comes down to is just some random bitshifting and whatnot. It's theoretically possible to construct perfect hashing functions, but that requires knowledge of all possible values it can be fed and their distribution. You don't have that information though, and simply assuming uniform distribution of characters and string lengths won't make the best hash function for strings of english words. This leaves you with the random fiddling to try to get a good hash function. All the string hashing functions I've seen have had some random constants thrown in, and each time it was commented it was something like "we don't know why this constant works better than others. We just typed some random number in and got a really good distribution out of it in the end." Most of the time they're prime or semi-prime, or at least have few prime factors.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.

Unparallelogram
Posts: 336
Joined: Mon Jul 28, 2008 4:16 am UTC

Re: Dictionary Hash Function

Postby Unparallelogram » Thu Dec 10, 2009 10:48 pm UTC

Depending on how long your strings are, you might need to be careful about cycling over stuff. Just xor-ing in stuff, especially if your rotations cycle soon, might zero out a lot more than you want. Primes are kinda nice cause they give you long cycles, for one thing.

User avatar
Kow
Posts: 195
Joined: Sat Apr 19, 2008 12:37 pm UTC

Re: Dictionary Hash Function

Postby Kow » Thu Dec 10, 2009 11:38 pm UTC

The range of string length are 3 <= string length <= 6.

I currently have:

Code: Select all

   int index = 1;

   for(int i=0; i<v.size();i ++)
   {
      index ^= 173 ^ ((v[i]*91) << i);
   }
   return index%BUCKET_SIZE;


Where BUCKET_SIZE is #defined currently at 2371 and v being the string/word. This gives me the largest bucket: 18, average size: 7.5

I'm at 0% empty though, so that's good. I'm shooting for a sub 10 for max size if possible. Is there any way I can twerk this with some sort of logic behind it, or is the only way I could get a better spread would be just tinkering with numbers?
Image

Unparallelogram
Posts: 336
Joined: Mon Jul 28, 2008 4:16 am UTC

Re: Dictionary Hash Function

Postby Unparallelogram » Fri Dec 11, 2009 12:13 am UTC

You're xor-ing 173 in and out of index repeatedly. It's not giving you anything extra.
It might be easier to rotate index between iterations, rather than everything else around it.
If you only left shift, you're shifting entropy out of low order bits. The lowest few bits of index are influenced only by the first few entries.
If you always multiply by 91, you may or may not be hurting the low order bits.

User avatar
Kow
Posts: 195
Joined: Sat Apr 19, 2008 12:37 pm UTC

Re: Dictionary Hash Function

Postby Kow » Fri Dec 11, 2009 12:17 am UTC

Rotate index?

Edit: Oh, like bit wrapping? 0000 0001 >> 1 would yield 1000 0000 not 0000 0000?

Uhh. I don't think that's what you mean, but to reduce an uneven distribution of entropy might benefit from that making it so I can use only one type of bitshift.
Image

Unparallelogram
Posts: 336
Joined: Mon Jul 28, 2008 4:16 am UTC

Re: Dictionary Hash Function

Postby Unparallelogram » Fri Dec 11, 2009 2:04 am UTC

No. If you shift bits off the end, you lose them forever.
To rotate, you shift in both directions, losing halves, and then xor the parts together.
(x >>> n) ^ (x << (w - n))
rotates x right by n bits on a word size w integer.

My main point is, either you have to move index around, or add stuff to it in a more irregular way. Some bits of index currently contain a lot more information than others, and some almost none at all.

Your current system is xor-ing into index values that are increasingly filled with zeroes in their least significant bits.

User avatar
Kow
Posts: 195
Joined: Sat Apr 19, 2008 12:37 pm UTC

Re: Dictionary Hash Function

Postby Kow » Fri Dec 11, 2009 2:06 am UTC

Unparallelogram wrote:No. If you shift bits off the end, you lose them forever.

There are ways to get by that, but yea.

I'll mess around with that rotate. Thanks :D
Image

User avatar
PM 2Ring
Posts: 3715
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Sydney, Australia

Re: Dictionary Hash Function

Postby PM 2Ring » Fri Dec 11, 2009 3:47 pm UTC

Berengal wrote:What it all comes down to is just some random bitshifting and whatnot. It's theoretically possible to construct perfect hashing functions, but that requires knowledge of all possible values it can be fed and their distribution. You don't have that information though, and simply assuming uniform distribution of characters and string lengths won't make the best hash function for strings of english words. This leaves you with the random fiddling to try to get a good hash function.


In a real application, it is often the case that you can create a pool of typical or common values. This can be used to generate a perfect (or close to perfect) hash function that ought to perform well with the real data.

I have C code that generates perfect hash functions (hiding somewhere in my archives), but I will admit that it does get slow with large data sets. I never really used it for anything of significance, though. Just a simple test program that hashes the names of the chemical elements.

User avatar
Cosmologicon
Posts: 1806
Joined: Sat Nov 25, 2006 9:47 am UTC
Location: Cambridge MA USA
Contact:

Re: Dictionary Hash Function

Postby Cosmologicon » Fri Dec 11, 2009 6:44 pm UTC

Kow wrote:Recently I've been assigned a project to make a dictionary hash table (among other things). We're provided the dictionary file and are required to load it in using a hash function of our own design. The catch, however, is that we need at most 5% of the buckets in the table to be empty. I managed to make one based off character values and such, but it only gets it barely to 5%.

Just out of curiosity, what's the size of your dictionary file, and how many buckets are in your table? A theoretically perfect hash function will still leave some buckets empty on average. A quick simulation suggests that your if your dictionary is less than 3 times the size of your table, at least 5% of the buckets will be empty on average.

OOPS: I think that the factor in question is actually -ln(0.05) = 2.9957....

User avatar
Kow
Posts: 195
Joined: Sat Apr 19, 2008 12:37 pm UTC

Re: Dictionary Hash Function

Postby Kow » Sat Dec 12, 2009 8:03 am UTC

Cosmologicon wrote:
Kow wrote:Recently I've been assigned a project to make a dictionary hash table (among other things). We're provided the dictionary file and are required to load it in using a hash function of our own design. The catch, however, is that we need at most 5% of the buckets in the table to be empty. I managed to make one based off character values and such, but it only gets it barely to 5%.

Just out of curiosity, what's the size of your dictionary file, and how many buckets are in your table? A theoretically perfect hash function will still leave some buckets empty on average. A quick simulation suggests that your if your dictionary is less than 3 times the size of your table, at least 5% of the buckets will be empty on average.

OOPS: I think that the factor in question is actually -ln(0.05) = 2.9957....

The dictionary itself is 100,000 words large, but a number of them are unusable for the program's intention, so it's cut down to just over 20,000. I'm currently at 5,000 or so buckets at around .1% empty. The speed, even when searching through very large buckets, is pretty fast regardless, so I'm aiming for the less empty side rather than the more buckets side.
Image

User avatar
Berengal
Superabacus Mystic of the First Rank
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway
Contact:

Re: Dictionary Hash Function

Postby Berengal » Sat Dec 12, 2009 10:46 am UTC

I've got one question: How does it work on other datasets? Perhaps you've found an algorithm that's good on just this dictionary, but sucks at others.

It's something I'd be worried about at least, and your instructors might be using a different dictionary than the one you've been using when verifying the algorithm works as expected.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.

User avatar
Cosmologicon
Posts: 1806
Joined: Sat Nov 25, 2006 9:47 am UTC
Location: Cambridge MA USA
Contact:

Re: Dictionary Hash Function

Postby Cosmologicon » Sat Dec 12, 2009 3:41 pm UTC

Kow wrote:The dictionary itself is 100,000 words large, but a number of them are unusable for the program's intention, so it's cut down to just over 20,000. I'm currently at 5,000 or so buckets at around .1% empty. The speed, even when searching through very large buckets, is pretty fast regardless, so I'm aiming for the less empty side rather than the more buckets side.

Yeah, I think this is really a cause for concern. If I'm getting this right, then a perfect hash function should only have about 1.8% empty buckets (exp(-4) = 0.018). If you're getting 0.1% empty and you're trying to cut that down, that means you've got an imperfect hash function and you're trying to make it worse. If you think I'm wrong about this, you could try running a hash function that's known to be good on your dictionary and seeing how many buckets wind up empty.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 10 guests