Hopscotch hashing

2013 August 11
by Emmanuel Goossaert

I am currently experimenting with various hash table algorithms, and I stumbled upon an approach called hopscotch hashing. Hopscotch hashing is a reordering scheme that can be used with the open addressing method for collision resolution in hash tables. When using open addressing with only a probing sequence and no reordering, entries are inserted in the first empty buckets found in the sequence. With a reordering scheme, entries already in the table can be moved as new entries are inserted. Hopscotch hashing is interesting because it guarantees a small number of look-ups to find entries. In addition, those look-ups are in contiguous memory areas, which is very cache friendly.

This article makes minor contributions to the original publication. First, a clear explanation of the insertion process is being given, which is independent from the representation of neighborhoods. The original paper was using the bitmap representation to present the algorithm, and I believe that things are simpler without it. Second, a new neighborhood representation is introduced, it is called the “shadow” representation. It derives the relationship between neighborhoods and initial buckets from the hashed keys, and does not require any extra memory. Finally, a C++ implementation is provided, and experimentation with this implementation allows to shed light on some of the limitations of hopscotch hashing that were not expressed in the original paper.

hopscotch-hashing_more_web

1. Overview

Hopscotch hashing was introduced by Herlihy et al., 2008. Two versions of the paper can be found online, a draft [1] and the final version [2]. The C++ source code associated with the paper is also available [3]. The original paper was mostly focusing on multi-core and concurrent environments. Those aspects are not being discussed in the present article, which focuses on single-core environments.

The main idea behind hopscotch hashing is that each bucket has a neighborhood of size H. The neighborhood of a bucket B is defined as the bucket itself and the (H-1) buckets following B contiguously in memory (H buckets total). This also means that at any bucket, multiple neighborhoods are overlapping (H to be exact). Hopscotch hashing guarantees that an entry will always be found in the neighborhood of its initial bucket. As a result, at most H consecutive look-ups will be required in contiguous memory locations, which is extremely cache friendly.

2. Basic operations

2.1 Entry insertion

In order to insert a new entry, its key is hashed to find the initial bucket for the entry, denoted as B. The search for an empty bucket E then starts, using linear probing in the neighborhood of B. If an empty bucket is found, the entry is saved there. Otherwise, the linear probing is pushed beyond the border of the neighborhood of B, until an empty bucket is found. If the entry were saved there, subsequent trials to retrieve it would fail, because the search would be bounded to the neighborhood of B. In order to fix that, the empty bucket E needs to be “moved” inside the neighborhood of B.

In order to move the empty bucket E from the location where it was found into the neighborhood of B, it needs to be swapped with entries between the positions of B and E. An entry will respect the hopscotch guarantee if it can be found in the neighborhood of its initial bucket, i.e., in its initial bucket or in any of the following (H-1) buckets. Consequently, the empty bucket E can be swapped with any of the preceding (H-1) entries that have the position of E in their neighborhoods.

At this point, I believe that long paragraphs of explanations would not be of any help. Instead, I am presenting the insertion process of hopscotch hashing with a diagram, in Figure 1 below. For each step, the array on the left represents the status of the hash table, and the bullet points on the right provide some information as to what the algorithm is doing.

hopscotch-hashing_web

Figure 1

2.2 Entry retrieval

The first step to retrieve an entry is to determine its initial bucket. Then, all what has to be done is to start from the position of this initial bucket and to scan through the next (H-1) buckets, and for each bucket, to compare the key with the key of the entry being searched. If the key does not match any of the keys for the entries in the neighborhood of the initial bucket, then the entry is simply not in the table.

2.3 Entry deletion

The deletion process is simple: replace the entry to delete by an empty entry, which can be null or whatever value was chosen for the implementation at hands. This is a major improvement compared to basic open addressing that uses only probing. In that that case, entries would be removed by lazy deletion, which require the introduction of special values to mark deleted buckets. Lazy deletion leads to the apparition of large areas of buckets marked as deleted, that are unused but yet must be scanned through during look-ups, a phenomenon called contamination.

3. Neighborhood representations

In the original paper, two representations of the neighborhoods were covered [2]. The first one is using one bitmap per bucket, and the second one is using a linked list per bucket. Those two representations are described below. I am also introducing a new representation that I call “shadow”, which derives the relationship between an entry and its neighborhood using the hashed key of that entry.

3.1 Bitmap representation

Neighborhoods can be stored using bitmaps (bit arrays). With this solution, each bucket needs a bitmap in addition to all the data already needed for that bucket. Each bit in that bitmap indicates if the current bucket or one of its following (H-1) buckets are holding an entry which belongs to the neighborhood of the current bucket. In the paper, the value of H = 32 was chosen, so that the bitmap could be stored using a 32-bit unsigned integer. Figure 2 below shows the hash table presented in Figure 1, and along with it the bitmaps for each of the buckets. The size of the neighborhoods is H = 4. In order to understand what those bitmaps are doing exactly, let’s take a look at bucket 5. It has one entry in its neighborhood, stored in bucket 6. The bitmap for bucket 5 is therefore 0100, with a bit set to 1 at index 1, because bucket 6 is at an offset of 1 from bucket 5.

The main drawback of using bitmaps is that the maximum size of the neighborhoods is limited by the number of bits in the bitmaps. The size of the bitmaps could be increased, but that would mean that the size of each bucket would be increased, which at some point would make the whole bucket array explode in memory.

hopscotch-hashing-bitmaps_web

Figure 2

3.2 Linked-list representation

If neighborhoods are stored using linked lists, each bucket has a pointer to the next bucket in the same neighborhood. Since the buckets in a same neighborhood will be found at a bounded distance, at most (H-1), small integers can be used to store offsets instead of larger and more costly addresses, which can save up a lot of memory. In the code associated with the paper [3], this offset solution is the one that was chosen.

An advantage of using linked lists is that the size of the neighborhoods can be much larger without too much overhead on the total size of the bucket array. One of the drawbacks of using linked lists is that it may negate cache friendliness, because it prevents spatial locality and prefetching, as the pointers may jump forward and backward in memory.

3.3 Shadow representation

This representation was not part of the original paper, this is just an idea that I wanted to experiment with. Let’s assume that the hashed keys are stored in the bucket array. Due to the hopscotch method, an entry may not be in the bucket it was hashed to, its initial bucket, but most likely in a bucket in the neighborhood of its initial bucket. From the hashed key only, it is possible to find for any entry the position of its initial bucket using the modulo operator. From there, the neighborhood to which the entry belongs can be determined, which is the initial bucket that was just derived and the next (H-1) buckets. Because the relation between an entry and its neighborhood is being derived from the information already in the table and is not stored, I have named it the “shadow” representation.

Storing the hashed keys is frequently required. For example if the implementation of a hash table needs to allow for resizing, then in some cases performance can be improved by storing the hashed keys. As I am researching collision resolution methods to store hash tables on disk, storing the hashed key of each entry in the bucket array is something I am strongly considering. It is true that storing the hashed keys requires more memory, but it avoids the need for extra memory to store the neighborhoods as it is the case with the bitmap and linked-list representations.

It is important to understand that the relationship between an entry and its neighborhood is reversed in the shadow representation compared to the bitmap and linked-list representations. Indeed, for the bitmap representation, the neighborhood of a bucket is determined using the bitmap stored with that bucket. With the shadow representation, a bucket is accessed and then its initial bucket is determined, which allows to know to which neighborhood this bucket belongs to. This is just a detail, but this will show up in the implementation and therefore it is something to keep in mind.

Finding the initial bucket from a hashed key requires the use of the modulo operator. This can appear to be costly compared to just storing the neighborhoods, but actually if the size of the bucket array is a power of two, then the modulo can be computed using a bitwise AND, which is extremely fast. Indeed, a % 2^n = a AND (2^n – 1).

Finally, the shadow representation allows for larger neighborhood sizes, in theory for neighborhoods as large as the bucket array is itself. And larger neighborhoods can be useful to allow for more entries to be stored in the hash table, and reach higher load factors.

4. Source code

Talk is cheap. Show me the code.
— Linus Torvalds

I have implemented the bitmap and shadow representations of the hopscotch hashing algorithm in C++. The code is available on GitHub [4].

A Makefile is included and should simplify compilation. From there, simply calling the program with the --help parameter gives a full description of the options available: ./hashmap --help

I have only performed limited experiments, and not a thorough statistical study. Nevertheless, with this code I was able to find out more about the practical behavior of hopscotch hashing, and some of the limitations that were not described in the original paper.

5. Discussion

5.1 Criticism of the original paper

In this section I am trying to shed light on a downside from the original paper. This may come across as I am picking on the paper, but I am not, I am just pointing out something I find to be inconsistent.

The code from the paper implements two representations of the hopscotch hashing, the bitmap and the linked list [3]. The implemented bitmap representation follows what is described in the paper, but surprisingly, this seems not to be the case for the linked-list one. Indeed, for the linked-list implementation, it seems that the code of the paper does not have the hopscotch algorithm in the putIfAbsent() method (HopscotchHashMap.h). From what I understand from the code, it appears that putIfAbsent() is just implementing a linear probing and that buckets are then linked together to prevent probing all buckets when calling containsKey(). But no trace of displacements as described in the paper. Some displacements are made in optimize_cacheline_use() which is called from remove(), but nothing more.

I am aware that maybe I just don’t understand this code properly, but I believe I do. Therefore either the version of the code that was released is not the most recent that was used for the paper, or the authors haven’t fully implemented what they described. If the second option is true, this would mean that the experimental results presented in the paper were not obtained using hopscotch hashing, and therefore that the conclusions reached by this paper could not be taken seriously.

This being said, I still find the hopscotch hashing algorithm to be interesting, and it totally deserves to be implemented and tested. If someone has a better understanding of this code than I do, please drop a comment below.

5.2 Clustering

The major limitation of hopscotch hashing is that the load factor does not increase as described in the original paper. In the paper, one of the proofs states: “the probability of having more than H keys in a bucket is 1/H!“. With a neighborhood of size 32, that’s a probability of 1/32!, which is extremely unlikely. This proof may create a misunderstanding, which is that the load factor can increase to very high values because hopscotch hashing will prevent clustering. But this is not the case. This proof only says that the chance of having H entries in a single neighborhood is 1/H!. However, this does not prevent multiple buckets to cluster the overlapping area of their respective neighborhoods. Figure 3 below shows such as case of clustering for hopscotch hashing with neighborhoods of size H = 8. Bucket 0 is the initial bucket, and bucket 11 is the empty bucket found by linear probing.

hopscotch-hashing-clustering_web

Figure 3

In Figure 3, all the buckets in the area of the swap candidates are in the neighborhoods of buckets that are before the beginning of the swap area. Because of that, none of those buckets can be swapped with the empty bucket. This clustering behavior occurs around load factors of 0.8 to 0.9, depending on the keys, the hash function, etc. I have also tested with the source code of the paper using the implementation of the bitmap representation, “bhop” from BitmapHopscotchHashMap.h, and I have observed a similar behavior [3]. Anyway at such high loads, the best thing to do would be to resize the table.

5.3 Larger neighborhood size

A great thing with the shadow representation is that the largest size of the neighborhood is theoretically the size of the hash table itself. Of course having such as large value would be useless, because it would be equivalent to using only linear probing with no reordering. But I wanted to try the hopscotch hashing with neighborhood sizes larger than the value of 32 tested in the original paper.

What it appears is that with neighborhoods of size 64, the load factor at which clustering prevents inserting is around 0.90, and with neighborhoods of size 128, the load factor can be pushed towards around 0.95. Another way to look at it is that increasing the size of the neighborhoods follows the law of diminishing returns. Increasing the size of the neighborhood beyond that will give smaller and smaller improvements, and therefore is not worth it. My feeling is that the sweet spot for the size of the neighborhoods lies somewhere between 32 and 64, and beyond that, resizing the table is the best option.

5.4 Variable neighborhood size

Another idea that I wanted to try was to have variable neighborhood sizes. The idea is simple: when it is impossible to insert an item due to clustering, instead of resizing the table, simply double the size of the neighborhood. Starting at 4, the size of the neighborhood would be doubling until it reaches 128. My intuition was that by starting with smaller neighborhood sizes, items would be more spatially localized, which would allow for higher load factors to be reached than with constant neighborhood sizes. But during my short experiments, it has been clear that using an increasing variable neighborhood size does not have any serious advantage over using a constant neighborhood size, at least for the goal of reaching higher load factors.

6. Conclusion

This was just a short presentation of hopscotch hashing. In spite of the clustering effect that was observed, the guarantee offered by hopscotch hashing of having a bounded number of look-ups in contiguous memory is very compelling. Indeed, my main goal being to find a good method for storing hash tables on disk, hopscotch hashing would help limiting the number of blocks accessed on disk for any operation to the hash table.

This article is still missing an empirical study that would compare hopscotch hashing to other collision resolution methods. Further investigation should also be aimed at comparing constant-neighborhood and variable-neighborhood approaches, to see if they differ in metrics such as the average and maximum probing sequences. As I am planning to implement other reordering schemes, I am keeping this analysis for a future article, when I will be done implementing all the other methods that I want to test.

Remember to take a look at the source code on GitHub [4].

Happy coding! :)

7. References

[1] Submission draft for Hopscotch Hashing by Herlihy et at., 2008
[2] Hopscotch Hashing by Herlihy et at., 2008
[3] Source code from the authors of the original hopscotch hashing paper
[4] https://github.com/goossaert/hashmap

5 Responses leave one →
  1. Vadim permalink
    January 12, 2014

    Hi Emmanuel,
    If I read your ShadowHashMap code correctly, in lookup function you are linearly checking the whole neighborhood for the query key, which would include items that have a different key hash. The advantage of a bitmap or a linked list, as presented in the original paper, is that you only compare to keys of the items that landed in the same original bucket as the query key.

    Also, why do you say that the linked list is cache-unfriendly? If you enforced the same max distance from the original bucket as for the bitmap, i.e. 32, I’d expect it to perform just as well.

    • January 13, 2014

      You are right, the code for the Get() method in ShadowHashMap is checking every single bucket in the neighborhood. I am storing the hash of the key in each bucket, and my intent was to compare that stored hash with the hash of the key being looked up to avoid unnecessary accesses to the entry. I forgot to add the condition for that in the if statement, so that was definitely a bug. I just committed a fix, thanks for pointing it out! :)

      As for the linked-list neighborhood, I was referring to cache prefetching more specifically. Assuming deletions have occurred in the table, if you have two buckets X and Y in the same linked list with a link from X to Y, there is no guarantee that Y will be stored at an address higher than X, and indeed, Y could be at an address lower than X. If the whole linked list for a neighbordhood has links pointing randomly to higher or lower memory addresses, then going through this list would prevent the cache prefetching mechanism to work properly, as the memory wouldn’t be accessed sequentially. Although, I’ll admit it depends on the prefetcher (stream, stride, adjacent cache line, etc.)

      But in the end, my feeling is that with a densely filled hopscotch-hashing table, the exact neighborhood representation would not matter. Even by scanning only 1 out of 5 or 6 buckets in a neighborhood, the number of cache lines that would be loaded in the L1 cache would be roughly the same for all neighborhood representations, and there would be little difference in performance between them (assuming 64-byte L1 cache lines).

  2. Haravikk permalink
    June 23, 2014

    Thanks a bunch for posting this, it’s really helped me to understand how hop-scotch hashing works! I’ve been following along by writing an implementation in Java and it has pretty impressive performance!

    Like you I opted for your shadow method; in practice it doesn’t seem to really be any slower, and actually I didn’t find much benefit from storing hash values to avoid processing them, though I suppose that’s heavily dependent on your chosen hash function, though my focus is also on memory efficiency so I didn’t want the overhead anyway. I’m also using variable bucket sizes, increasing their size only when I can’t find anything to swap, and resetting whenever I resize the array; I was originally going to resize the array if the bucket size became too large, but in all my tests so far I haven’t been able to get the bucket size to automatically go higher than 64, I’m sure it could but it seems very unlikely to happen except with a really bad hashing function or unusual set of keys.

    Compared against Java’s HashMap implementation, which is a more traditional array of linked-list buckets, I notice the hop-scotch array has essentially opposite performance; Java’s hash map performs lookups of existing keys in a very large set (200,000 entries) in around 200ns on my machine, and non-existent ones in about 40ns, while the shadow hop-scotch map does it the other way round which seems preferable for cases where you are looking for keys you expect to find. This slower miss behaviour would be helped a bit by the bitfield or linked-list variations, but with a good hash function I don’t think there’s much in it, especially considering the lower overhead.

    My choice for hash function was the following:
    i ^ (~i << 11) ^ (i << 3) ^ (~i << 27)

    Its general purpose properties aren't great, but for hashing to get an index it produces unique indices around 69% of the time, which is the best I've seen; besides generating a perfect hash function for your set of course.

    • Haravikk permalink
      June 23, 2014

      That first set of brackets should be a left shift, not sure what happened there.

      Also, I forgot to mention that I did discover a tweak that can be useful in reducing the slower (though still fast) performance when searching for a non-existent key.

      It requires a distinction between entries that were initialised as null, and ones that were deleted, I’ve essentially done this by creating a shared DELETED entry that can be assigned instead of setting an entry to null when it is removed. Searching for an empty bucket will then search for a bucket that is either null or DELETED (doesn’t matter which it finds), and when performing a lookup, we can stop early if we encounter a null entry, as we know the neighbourhood doesn’t extend beyond this point.

      Once again this is still somewhat dependent upon the performance of the hash function, otherwise you won’t have many null entries, or they’ll be clustered somewhere. However if the map reaches a point where removals and insertions are roughly balanced, then insertions will occur within deleted entries rather than null ones, while the nulls kind of naturally indicate the end of neighbourhood clusters. Theoretically the nulls may disappear over time if the number of entries in each neighbourhood keeps expanding and contracting, but again, in a well distributed set that doesn’t seem to happen easily.

      • June 25, 2014

        Thanks for sharing your results!

        I fixed the shift in your hash function.

        Regarding storing the hashed keys in the bucket array, the main advantage is for when the data is stored in secondary storage (HDD, SSD, etc.). Indeed, when looking up a key, this allows to quickly compare its hash with the ones in the bucket array, and only retrieve data from the secondary storage when the hash values are matching. As the table fills up, this prevents the lookup method from doing many random reads on the secondary storage, which are costly.

        And nice find about using a special DELETED entry. However in a table where many insertions and deletions occur, after some time I would expect the DELETED entry to be pushed to the border of the neighborhood, loosing its initial advantage.

        You probably have already seen it, but just in case, here is a link to another article in which I gathered more data for open-addressing hash tables, including hopscotch hashing: http://codecapsule.com/2014/05/07/implementing-a-key-value-store-part-6-open-addressing-hash-tables/

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS