Robin Hood hashing

2013 November 11

As I am still doing research on open addressing hash table algorithms, I tested an approach called Robin Hood hashing. Just like for other reordering schemes, Robin Hood hashing will move entries that are already stored in the hash table as new items are inserted. In the original Robin Hood paper, it is claimed that Robin Hood hashing is very efficient due to the fact that the probe sequence length has a small and almost constant variance.

In this article, I am presenting how Robin Hood hashing handles insertion, deletion, and search, including a graphical representation for the insertion operation. I then use my own C++ implementation of Robin Hood hashing to see how the algorithm behaves under various load conditions and try to determine how efficient it really is. Finally, I discuss some of the drawbacks of Robin Hood hashing, and its applicability for the implementation of an on-disk key-value store.

robin-hood-hashing-more-web

1. Overview

Robin Hood hashing was introduced by Pedro Celis in his thesis in 1986 [1]. The idea is that entries are moved around based on how far they are from their initial buckets — the initial bucket being the bucket to which an entry is hashed. Robin Hood hashing takes buckets from entries that are closer to their initial buckets compared to the entries that need to be inserted. In that way, it takes from the riches and gives to the poor, which is where the name “Robin Hood hashing” originates.

Throughout this article, I will be referencing sections and figures from the original paper. When I do that, I will add “OP-” to the numeration of the sections and figures, in order to differentiate them from the references to the sections and figures of this article. For instance, Figure 1.2 from the original paper will be referred as Figure OP-1.2. In addition, the only copy of the original paper [1] that I was able to find is missing two pages, pp16-17. I believe that those pages did not hold any crucial information, and that it is possible to get a full understanding of Robin Hood hashing without them.

UPDATE 2014-03-15: The copy of the original paper [1] was updated and the two missing pages are now available.

In Section OP-1.2, the probe sequence is defined as: “the sequence of table entries to be inspected when inserting or searching for a record.” It derives from there that the probe sequence length (PSL) is the length of such a sequence, that is to say the number of entries to inspect when inserting or searching for a record. In Section OP-1.3, it is also mentioned that the PSL of a key is “equivalent to the probe position where the key was placed“. I find the notion of PSL and these definitions quite ambiguous, all the more so that later in the thesis are presented search algorithms that do not use linear probing, such as the organ pipe search and the smart search. With these search algorithms, an entry at a distance of K buckets from its initial bucket may be found by inspecting L entries, where L < K. Therefore, the notion of PSL can be misleading as it can refer to the distance of a bucket to its initial bucket, or to the number of entries that have to be tested during a search process.

To avoid ambiguity in the rest of this article, I will use the metric of distance to initial bucket (DIB), which is the number of buckets between the bucket where an entry is stored an its initial bucket, and the metric of probe for the number of entries that have to be inspected when an entry is searched or inserted.

Update 2013-11-17: I wrote a follow-up article on Robin Hood hashing presenting backward shift deletion, which outperforms Robin Hood hashing with tombstones along with basic linear probing [7].

2. Basic operations

2.1 Entry insertion

In order to insert an entry, its key is hashed to find the initial bucket. From there, a linear probing takes place, and for each bucket encountered, the DIB of the entry in that bucket is compared to the DIB of the entry to insert. If the DIB of the entry to insert is greater than the DIB of the entry in the current bucket, then the entries are swapped: the entry to insert is stored in the current bucket, and the entry that was originally in the current bucket become the entry to insert. This process continues until an empty bucket is found, in which the entry to insert can be stored without requiring any swapping. The DIB can be derived from the hashed representation of a key. In order to avoid recomputing the hashes for all the entries encountered during the linear probing, a solution is to store those hashes directly in the buckets of the table. The insertion process is presented with a schematic in Figure 1 below. For each step, bullet points on the right are providing more explanation as to what the algorithm is doing.

robin-hood-hashing-web

Figure 1: Insertion of an entry in a hash table using Robin Hood hashing

2.2 Entry retrieval

Entries can be found using linear probing starting from their initial buckets, until they are encountered, or until an empty bucket is found, in which case it can be concluded that the entry is not in the table. The search can also be stopped if during the linear probing, a bucket is encountered for which the distance to the initial bucket in the linear probing is smaller than the DIB of the entry it contains. Indeed, if the entry being searched were in the hash table, then it would have been inserted at that location, since the DIB of the entry in that location is smaller.

2.3 Entry deletion

Like in basic linear probing, deleting an entry in Robin Hood hashing cannot be performed just by removing the entry from the bucket array, because that would make the following entries in the same chain unreachable. Therefore, when deleting an entry, it has to be marked as deleted with a flag or a special value. Entries marked as deleted are also called tombstone entries. In addition, whatever was used to know its DIB — the key itself or the hashed key stored in the bucket — this has to be kept in the bucket. During insertions, deleted entries are treated as if they were not deleted, and are moved only if they would be moved if they had not been deleted. If a deleted entry is moved during an insertion, and becomes the entry to insert, it is simply discarded, and the insertion finishes.

Update 2013-11-17: There is a better alternative to using tombstones for deletion, it is to perform a backward shift [7].

2.4 Maintain the minimum and maximum DIB

In Robin Hood hashing, entries that are moved during an insertion process see their DIBs increasing. This means that overall, the DIBs are continuously increasing as entries are inserted. Since the average DIB grows without limit, the number of probes to insert an entry is also growing without limit. As inserting an entry requires to test buckets from a DIB of 0 to the average DIB, the cost of inserting could be reduced by maintaining the minimum and maximum DIB for the whole table, and limit the insertion process to the buckets within those boundaries. The same boundaries would be used to reduce the number of comparisons for the search operation. This is discussed in Section OP-6.2, and the insertion algorithm is presented in Figure OP-6.1, which is reproduced in Figure 2 below.

robin-hood-hashing-op6.1

Figure 2: Reproduction of Figure OP-6.1

 

Because the DIBs are constantly increasing, a clustering effect starts taking place, and entries start aggregating around the mean. As stated in Section OP-4.1:

As the load factor increases, the average probe position for a full table increases at a logarithmic rate, but the variance appears to remain bounded by a constant. This means that most of the probability mass is found around the mean.

Figure OP-4.1 illustrates this phenomenon, and is reproduced below in Figure 3. In that figure, “psl” is equivalent to what I call “DIB”, distance to initial bucket.

robin-hood-hashing-op4.1

Figure 3: Reproduction of Figure OP-4.1

 

3. Search algorithms

Maintaining counters to know the minimum and maximum DIBs as expressed in Section 2.4 above intends to minimize the number probes necessary to find an item. The original paper presents alternative approaches that fulfil the same goal, and allow to search for an entry by working around the mean.

3.1 Organ-pipe search

The idea behind organ-pipe search is, for each possible DIB value in the hash table, to count how many entries are stored at this DIB. This creates a distribution of the DIBs in the hash table. The best way to minimize the number of probes to find an entry is to sort the DIBs by decreasing counts, and to try positions in the hash table in that order, because in terms of probability, the DIBs with the highest counts are the positions at which the entry is the most likely to be found.

The drawback is obviously that the distribution of DIBs has to be maintained and sorted, and that any operation on the hash table will require to first access this distribution, maybe modify it, and then access the data associated with an entry. This is not very cache-friendly.

3.2 Smart search

Since the DIB with highest probabilities are found around the mean, an alternative to the organ-pipe search would be to search around the mean first and then at positions further from the mean. Let’s consider the average DIB, denoted as m. When trying to find an entry, smart search will consider the positions in the order m, m+1, m-1, m+2, m-2, etc., until all possible DIBs have been covered, from 0 to the maximum DIB.

This is not as probabilistically correct as the organ-pipe and its DIB distribution, but it’s close enough to the real distribution to minimize the number of probes. And in spite of this tiny drop in performance, the advantage of smart search over organ-pipe is that it does not require to maintain and access the distribution of DIBs. The mean has to be maintained during insertions, but it is simpler than maintaining the whole distribution of the DIBs.

Unfortunately, like for organ-pipe search, the non-sequential access pattern of smart search is not very cache-friendly.

4. Source code

In order to test the behavior of Robin Hood hashing, I have implemented the algorithm in C++. The source code is available on Github [2]. The code includes a Makefile which will make compilation easy. Calling the main program with the –help parameter will give you a detailed description of the available options, with examples at the end describing how to test a hash table using Robin Hood hashing.

For the implementation, I have chosen not to implement either organ-pipe search or smart search, but to keep a simple linear probing with minimum DIB correction as described in Section 2.4 above. This search algorithm requires to maintain the minimum and maximum DIBs, which I am doing by implementing the function presented in the original paper in Figure OP-6.1, presented in Figure 2 earlier, which itself derives from the function in Figure OP-2.1.

5. Experimental protocol

I have tested my implementation of Robin Hood hashing over three test cases. Here are full descriptions of the steps for each of the test cases:

“loading” test case:
- Insert entries in the hash table until it full, up to a load factor of 0.98

- Measure statistics at every 0.02 increment of the load factor

 

“batch” test case:
Uses two parameters, Load Factor Max (LFM) and Load Factor Remove (LFR)
- Insert entries in the table up to LFM (with a table of 10k entries and LFM=0.8, 8k entries would be inserted)
- Do the following operations over 50 iterations (for 1 <= i <= 50):

  • Remove at once LFR entries of the total capacity (with a table of 10k entries and LFR=0.1, 1k entries would be removed at once)
  • Insert at once LFR entries of the total capacity (with a table of 10k entries and LFR=0.1, 1k entries would be inserted at once)
  • Measure statistics at the i-th iteration

 

“ripple” test case:
Uses two parameters, Load Factor Max (LFM) and Load Factor Remove (LFR)
- Insert entries in the table up to LFM (with a table of 10k entries and LFM=0.8, 8k entries would be inserted)
- Do the following operations over 50 iterations (for 1 <= i <= 50):

  • Remove a single entry, and immediately insert a single entry. Do this for LFR of the total capacity (with a table of 10k entries and LFR=0.1, a single entry would be removed, then a single entry would be inserted, and this, 1000 times)
  • Measure statistics at the i-th iteration

 

The statistics being observed are the mean, the median, the 95th percentile and the variance of the distribution of DIBs for all the entries in the hash table.

The keys for the entries are generated using the random function from the C++ standard library. For each test case, 10 instances have been run. This means that for the “batch” and “ripple” test cases, I have run 10 times 50 iterations, then I have averaged the 10 values corresponding to each iteration. For the “loading” test case, I have also run 10 instances, but there is only one iteration, which is why the x-axis in the analysis is not “Iterations” but “Load factor“. For each instance, the same seed value was used across all test cases, in such a way that the k-th instance of a test case was run using the same random keys as for the other test cases.

I have also run tests on a basic linear probing hashing algorithm, so that I could have a reference point when analyzing Robin Hood hashing. I have used the same test cases and the same random keys, to make the comparison fair.

For both the Robin Hood hashing and the basic linear probing implementations, I have used the 128-bit MurmurHash3 hash function which I truncated to the first 64 bits [3].

Finally, all the test cases were run for two different hash table sizes: 10k and 100k, to see if the size would have an impact.

6. Results

The results are presented in the four figures below, each figure showing the same statistic across the different test cases:

  • Figure 4 is the mean DIB
  • Figure 5 is the median of DIB
  • Figure 6 is the 95th percentile of DIB
  • Figure 7 is the variance of DIB

Each of these figures is holding sub-figures, each for a different test case:

  • (a) is the “batch” test case, with LFM=0.8 and LFR=0.1
  • (b) is the “batch” test case, with LFM=0.8 and LFR=0.8
  • (c) is the “ripple” test case, with LFM=0.8 and LFR=0.1
  • (d) is the “loading” test case

I am also showing the results from the original paper:

  • Figure 8 is a reproduction of Figure OP-6.5 (mean DIB in original paper)

The mean DIBs are adjusted for the Robin Hood graphs, in such a way that the minimum DIB is shifted to zero, and this in order to make a fair comparison with the basic linear probing. Indeed, because the implementation of Robin Hood hashing used here is considering only probes between the minimum and maximum DIBs, the probing of an item never starts at DIB 0 but at the minimum DIB. Therefore shifting the graphs for Robin Hood hashing makes sense.

Update 2013-11-17: Take a look at the plots in the follow-up article: Robin Hood hashing with backward shift deletion outperforms basic linear probing [7].

 

 

robin-hood-hashing-mean

Figure 4: Mean DIB averaged over 10 instances with random keys

 

 

robin-hood-hashing-median

Figure 5: Median of DIB averaged over 10 instances with random keys

 

 

robin-hood-hashing-95th-percentile

Figure 6: 95th percentile of DIB averaged over 10 instances with random keys

 

 

robin-hood-hashing-variance

Figure 7: Variance of DIB averaged over 10 instances with random keys

 

 

robin-hood-hashing-op6.5

Figure 8: Reproduction of Figure OP-6.5

 

 

7. Discussion

The first interesting thing to notice is that almost the same results were obtained for hash tables of size 10k and 100k, for both the linear probing and Robin Hood algorithms. This means that for both algorithms, the size of the hash table does not have an influence on the distribution of DIBs.

I tried reproducing the results from the original paper, especially for the mean DIB. Figure 8 was showing the mean probe in Robin Hood hashing after numerous “replacement” operations. A replacement was described at the beginning of Section OP-6.3 as “a deletion followed by an insertion“. Figure 8 was generated using the insertion algorithm in Figure 2, which corresponds to the algorithm described in Section 2.4.

I believe that I was successful in reproducing the results of the original paper in Figure 8 with the “batch” test case using the parameters LFM=0.8 and LFR=0.8, presented in Figure 4(b) above. There are differences — my averages are slightly higher — which can be explained by various factors: difference in the distribution of the keys, or difference in the hash function used for example. Other possible causes are a sudden lack of caffeine in my blood system when I implemented the function, and cosmic rays.

The original paper was stating that Robin Hood hashing has a very low and almost constant variance, and I was able to reproduce that, but only for the “batch” test case with LFM=LFR. Also, basic linear probing also has a low variance for the same test case, as seen in Figure 7(b).

As for the mean DIB in other test cases, like in Figure 4(a) and 4(c), Robin Hood hashing is doing a lot worse compared to basic linear probing. Indeed, after 50 iterations of deletions and insertions, the mean DIB in linear probing is around 5 and appears to have reached its maximum value, whereas it is around 30 for Robin Hood hashing, and is still growing. The same observation can be made for the median of DIB in Figure 5.

Something interesting maybe is the 95th percentile, presented in Figure 6. As it can be seen in Figure 6(a) and 6(c), the 95th percentile of the DIB is slightly lower for Robin Hood hashing compared to linear probing, at least until the 10th iteration. Then the 95th percentile stabilizes for basic linear probing, but continues to grow without a bound for Robin Hood hashing. This means that under 10 iterations of the operations of insertions and deletions applied, probing in the worst case will be more efficient with Robin Hood hashing than with basic linear probing, but on average, as the mean and median are showing in Figure 4 and 5, one would be better off with basic linear probing.

8. Conclusion

Robin Hood hashing is an interesting reordering scheme, and the results from the original paper looked promising.

It is true that I have not implemented the organ-pipe search or the smart search for entry retrieval, which according to the original paper allow for better performance. However, I did implement the same insertion algorithm, which was enough to compare the behavior of Robin Hood hashing after deletions in the original paper to my implementation.
Moreover, the organ-pipe search, the smart search, and the insertion algorithm require to keep track of either the minimum and maximum DIBs, or the mean DIB. This requires additional memory and processing, which messes up the cache-friendliness.

After implementing and testing Robin Hood hashing over various test cases, I was able to reproduce the results of the original paper for only one of the test cases, and even in that case, basic linear probing was doing better. For all the other cases, basic linear probing was performing either equivalently or better compared to Robin Hood hashing.

I am really not convinced that Robin Hood hashing delivers in terms of mean DIB, as the results are showing that a basic linear probing is performing better. And one has to consider that Robin Hood hashing is also more complex to implement, and that accessing items requires more processing compared to a basic linear probing.

I was able to find a few recent blog articles about Robin Hood hashing [4] [5] [6], with results going in the same direction as the ones presented in this article [5].

My personal opinion based on the results presented here is that Robin Hood hashing is a fun algorithm to play and experiment with, but I would not use it for any in-production system that requires serious performance.

If you think the approach employed in this article is missing something, or if you have an opinion regarding why Robin Hood hashing is behaving the way it does in the results presented above, drop a comment below, I’d be very happy to chat about it!

Update 2013-11-17: This conclusion is criticizing Robin Hood hashing using tombstones on deletion. But when using backward shift deletion, Robin Hood hashing clearly outperforms basic linear probing. Take a look at the follow-up article [7].

9. References

[1] Robin Hood Hashing, Pedro Celis, 1986
[2] https://github.com/goossaert/hashmap
[3] MurmurHash
[4] Robin Hood Hashing, Sebastian Sylvan
[5] More on Robin Hood Hashing, Sebastian Sylvan
[6] Robin Hood Hashing, Paul Khuong
[7] Robin Hood Hashing: backward shift deletion, Emmanuel Goossaert

4 Responses leave one →
  1. Andre Bogus permalink
    March 16, 2014

    I may read the code wrong, but it appears to me that the backshift_hashmap put operation searches for a free slot by probing each field, adding 1 to the index on each iteration.

    Knuth has shown that hashes work best when using a bigger probe delta (e.g. 1 + (h % (size – 2))). I can however see that using a bigger delta would complicate the distance calculation. Has this informed the tradeoff or could both approaches be mixed to gain even higher performance?

    Regards,
    Andre

    • March 23, 2014

      You read the code correctly, the search is performed by iterating linearly over the bucket array, starting from the initial bucket for the key being looked-up. I am not familiar with the exact probing scheme that you mentioned, though I did consider other schemes such as quadratic probing and double hashing. As you surely already know, they both intend to diminish the main drawback of basic linear probing which is clustering, when many keys hash to the same bucket or same bucket neighborhood. However, just like basic linear probing, they suffer from contamination, when many buckets contain tombstones entries after deletions have occurred. The main advantage of Robin Hood hashing with backward deletion (RH) over those approaches is that the variance of the DIB is guaranteed to bounded, and more interestingly, bounded to a low value, which means very few buckets will need to be probed to find an entry.

      It is perfectly possible to mix the reordering done by RH with non-linear probing schemes, though it may end up being harmful. Indeed, The average DIB will remain the same, the only thing that will change is that the probing scheme will no longer be linear and therefore during a lookup, it will force the pointer to jump to non-contiguous areas in the memory, loosing the advantage of caching and locality of reference.

      But bear in mind that this is all speculations. The easiest way to find out is to implement it and get some data :)

      UPDATE: When I said that the DIB will remain the same, I meant the number of buckets to probe. The actual distance from the initial bucket, in terms of difference between array indexes, will obviously increase.

  2. Mark Pritchard permalink
    September 4, 2014

    Hi Emmanuel,

    Great post, thanks for the details. I’m looking at implementing a Robin Hood hash for a high-speed off-heap collection and like it for the cache friendliness. :)

    Just one thing, I think your diagrams of the insertion are wrong, specifically x(0) in bucket 2. Should it be c(2) in all cases?

    Cheers

    Mark

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS