22607

Linear probing huge sequences of keys with unequal hash

Question:

There is one thing about linear probing (hash tables) that is not intuitive to me. If I put key1 which hash results to array index 1. Then I put key2 -> array index 2. Then I put key3 -> again array index 1, this will go to array index 3. Then when I search for key3 I should go through indexes that contain keys that do not have the same hash as mine at all. Isn't this waste? If the sequence is really big and contains many keys (for example I have 20 elements, then null, for any key that results in array index from 0 to 20 I must go through all the indexes although they do not have the same hash as mine and I can eliminate this with separate chaining).

Or this is mitigated by the fact that our hashing function (if written well enough) distributes the keys equally among the indices and we constantly resize the array to be max half full?

Answer1:

Linear probing is sub-optimal when there are many collisions. Note that the number of collisions doesn't only depend on the hash, but also on the number of slots in the table (usually a prime number) because the index is the remainder of the integer division of the hash by the table length.

Note however, than having the colliding keys one next to the other might also take advantage of the CPU caches, which will bring from RAM many elements in one read. So, do not (in principle) think that the time it takes to check 20 probes is 20 times the time it takes to check one, because what happens inside the CPU and its caches is much faster than going to RAM. There is no magic though. If the computation of every comparison throws away what's in the caches, then part of the savings will be lost.

Answer2:

The issue you've identified is indeed something that can impact the performance of linear probing. When you try to look up some element, you may have to look pretty far from where the initial hash probe started in order to find your element.

That being said, linear probing is extremely fast in practice, and that's mostly due to locality of reference. The cost of looking up something in memory is not uniform - if you look up an address near something you've already read recently, chances are that memory region has been pulled into cache and the cost of looking things up is extremely low. As a result, the cost of those probes in practice is often less than you might naturally expect, since those probes are probably pretty quick.

However, this doesn't mean you can ignore this fact. There are a number of issues to keep an eye out for. First, as the load factor of the table increases, there comes a point where the cost of hitting other elements starts to make lookups take progressively longer and longer. Usually, you'll see people rehash into a bigger table at around a load factor of 75%. Second, you need to have a pretty good hash function, since if you have a low-quality hash that drops lots of elements into similar locations, you'll get really terrible performance for the reason you've mentioned.

There are a couple of techniques you can use to mitigate this. Robin Hood hashing works by moving around elements after they've been placed down so that elements that are closer to home get pushed further away to make room for elements that are closer to home. This makes the average cost of a lookup a bit higher, but dramatically reduces the worst case cost of a lookup (in other words, it reduces the variance of a lookup cost in exchange for increasing the expected value of that lookup cost). Hopscotch hashing works by capping the maximum distance that elements can be displaced and maintaining a bitmask indicating which elements nearby might be a match, reducing the amount of work that you need to do to find things. And the new Google flat_map starts with linear probing and uses really clever hashing and parallel memory operations to make lookups extremely fast.

Recommend

  • Can I link a 32-bit native dll (not .Net assembly) into my 64 bit .Net applicaiton?
  • Unrecognizable template declaration/definition
  • How to refresh a page after back bottun pressed
  • ld: warning: section __DATA/__objc_imageinfo__DATA has unexpectedly large size
  • Deserialize query string to JSON object
  • Delete to end of line after a match, keep lines not matched
  • Create a JSON document from NSObject with SwiftyJSON
  • opencv (emgucv) not working in unity in osx?
  • How to get triad census in undirected graph using networkx in python
  • Inheritance and lazy loading in NHibernate
  • Password_verify in PHP
  • how to refresh alertdialog in flutter
  • Merge objects on shared Key Value Pair with Lodash
  • Java encoding with Eclipse and Maven
  • TeamCity: Scripting elements jsp:declaration, jsp:expression, jsp:scriptlet are disallowed here
  • php using msaccess
  • Who should create view model instances in MvvmCross
  • jQuery/mobile Safari bug?
  • How to get .mpd file for a youtube video
  • Find a directory using wildcard in Inno Setup
  • Detect when MathJax has finished loading in UIWebView
  • getting the values of checkboxes in a checkboxlist control
  • Misplaced CAGradientLayer on iPhone 6s
  • Python sum values in tuple in a list in a dictionary?
  • How does the dispatcher work when mixing sync/async with serial/concurrent queue?
  • Shiny - change the size (padding?) of dropdown menu (select tags) smaller
  • How to find angle formed by the blades of a wind turbine with respect to a horizontal imaginary axis
  • playing mp3 from nsbundle
  • Set SelectedIndex of ListView in FlipView_SelectionChanged event
  • How to include associated objects using gon in Rails/jQuery
  • How can I ssh into a server that requires 2 password authentication using python's paramiko mod
  • Google App Engine backend servlet not responding
  • Background transfer download task failed when app was closed
  • Make checkout phone field optional for specific countries in WooCommerce
  • Excel VBA : conditional formatting of sheet1 cells from sheet2 values in excel 2007
  • XEP-0166: Jingle protocol implementation for voice/video chat in iOS