Following deletion of ‘x’ from bucket 3, a linear search (called the first search) begins at bucket 4 in order to find a bucket whose initial bucket is less than or equal to 3. The first search is confined to the neighborhood of bucket 3 and hence will terminate at or before bucket 6, given that the neighborhood size H equals 4. Also, the first search will terminate prior to bucket 6 if it finds either an empty bucket or a bucket whose initial bucket is less than or equal to 3.

The first search rejects bucket 4 whose initial bucket is 4, then finds bucket 5 whose initial bucket is 3, so ‘r’ is moved from bucket 5 to bucket 3. Then a second search begins at bucket 6 in order to find a bucket whose initial bucket is less than or equal to 5. The second search is confined to the neighborhood of bucket 5 and hence will terminate at or before bucket 8.

The second search finds bucket 6 whose initial bucket is 3, so ‘c’ is moved from bucket 6 to bucket 5. Then a third search begins at bucket 7 in order to find a bucket whose initial bucket is less than or equal to 6. The third search is confined to the neighborhood of bucket 6 and hence will terminate at or before bucket 9.

The third search rejects bucket 7 whose initial bucket is 7, then finds bucket 8 whose initial bucket is 5, so ‘s’ is moved from bucket 8 to bucket 6. Then a fourth search begins at bucket 9 in order to find a bucket whose initial bucket is less than or equal to 8. The fourth search is confined to the neighborhood of bucket 8 and hence will terminate at or before bucket 11.

The fourth search finds empty bucket 9 and terminates. Because bucket 9 is empty, the “backward shift deletion” algorithm also terminates and no more searches are performed. At this point, the state of the hash table is as shown prior to Step 1 in Figure 3. The “backward shift deletion” algorithm has reverted the Hopscotch hash table to its state prior to insertion of ‘x’.

It is important to distinguish between termination of a search and termination of the “backspace shift removal” algorithm. A search can fail either because it finds an empty bucket or because it fails to find a bucket having the required initial bucket. In either case, a new search is not initiated, so the “backward shift deletion” algorithm terminates.

On the other hand, a successful search will always initiate another search. In order to avoid an infinite number of searches, the “backward shift deletion” algorithm should be terminated after it has inspected a predetermined number of buckets. How many buckets to inspect prior to termination is an open question.

Note that the Hopscotch insertion algorithm discussed in Section 2.1 above performs a linear search to find an empty bucket. If no empty bucket is found, the insertion algorithm is terminated automatically after it has inspected a predetermined number of buckets. Perhaps the “backward shift deletion” algorithm could be terminated automatically after it has inspected this same predetermined number of buckets.

]]>(In that that case -> In that case) ]]>

I fixed the shift in your hash function.

Regarding storing the hashed keys in the bucket array, the main advantage is for when the data is stored in secondary storage (HDD, SSD, etc.). Indeed, when looking up a key, this allows to quickly compare its hash with the ones in the bucket array, and only retrieve data from the secondary storage when the hash values are matching. As the table fills up, this prevents the lookup method from doing many random reads on the secondary storage, which are costly.

And nice find about using a special DELETED entry. However in a table where many insertions and deletions occur, after some time I would expect the DELETED entry to be pushed to the border of the neighborhood, loosing its initial advantage.

You probably have already seen it, but just in case, here is a link to another article in which I gathered more data for open-addressing hash tables, including hopscotch hashing: http://codecapsule.com/2014/05/07/implementing-a-key-value-store-part-6-open-addressing-hash-tables/

]]>Also, I forgot to mention that I did discover a tweak that can be useful in reducing the slower (though still fast) performance when searching for a non-existent key.

It requires a distinction between entries that were initialised as null, and ones that were deleted, I’ve essentially done this by creating a shared DELETED entry that can be assigned instead of setting an entry to null when it is removed. Searching for an empty bucket will then search for a bucket that is either null or DELETED (doesn’t matter which it finds), and when performing a lookup, we can stop early if we encounter a null entry, as we know the neighbourhood doesn’t extend beyond this point.

Once again this is still somewhat dependent upon the performance of the hash function, otherwise you won’t have many null entries, or they’ll be clustered somewhere. However if the map reaches a point where removals and insertions are roughly balanced, then insertions will occur within deleted entries rather than null ones, while the nulls kind of naturally indicate the end of neighbourhood clusters. Theoretically the nulls may disappear over time if the number of entries in each neighbourhood keeps expanding and contracting, but again, in a well distributed set that doesn’t seem to happen easily.

]]>Like you I opted for your shadow method; in practice it doesn’t seem to really be any slower, and actually I didn’t find much benefit from storing hash values to avoid processing them, though I suppose that’s heavily dependent on your chosen hash function, though my focus is also on memory efficiency so I didn’t want the overhead anyway. I’m also using variable bucket sizes, increasing their size only when I can’t find anything to swap, and resetting whenever I resize the array; I was originally going to resize the array if the bucket size became too large, but in all my tests so far I haven’t been able to get the bucket size to automatically go higher than 64, I’m sure it could but it seems very unlikely to happen except with a really bad hashing function or unusual set of keys.

Compared against Java’s HashMap implementation, which is a more traditional array of linked-list buckets, I notice the hop-scotch array has essentially opposite performance; Java’s hash map performs lookups of existing keys in a very large set (200,000 entries) in around 200ns on my machine, and non-existent ones in about 40ns, while the shadow hop-scotch map does it the other way round which seems preferable for cases where you are looking for keys you expect to find. This slower miss behaviour would be helped a bit by the bitfield or linked-list variations, but with a good hash function I don’t think there’s much in it, especially considering the lower overhead.

My choice for hash function was the following:

i ^ (~i << 11) ^ (i << 3) ^ (~i << 27)

Its general purpose properties aren't great, but for hashing to get an index it produces unique indices around 69% of the time, which is the best I've seen; besides generating a perfect hash function for your set of course.

]]>As for the linked-list neighborhood, I was referring to cache prefetching more specifically. Assuming deletions have occurred in the table, if you have two buckets X and Y in the same linked list with a link from X to Y, there is no guarantee that Y will be stored at an address higher than X, and indeed, Y could be at an address lower than X. If the whole linked list for a neighbordhood has links pointing randomly to higher or lower memory addresses, then going through this list would prevent the cache prefetching mechanism to work properly, as the memory wouldn’t be accessed sequentially. Although, I’ll admit it depends on the prefetcher (stream, stride, adjacent cache line, etc.)

But in the end, my feeling is that with a densely filled hopscotch-hashing table, the exact neighborhood representation would not matter. Even by scanning only 1 out of 5 or 6 buckets in a neighborhood, the number of cache lines that would be loaded in the L1 cache would be roughly the same for all neighborhood representations, and there would be little difference in performance between them (assuming 64-byte L1 cache lines).

]]>If I read your ShadowHashMap code correctly, in lookup function you are linearly checking the whole neighborhood for the query key, which would include items that have a different key hash. The advantage of a bitmap or a linked list, as presented in the original paper, is that you only compare to keys of the items that landed in the same original bucket as the query key.

Also, why do you say that the linked list is cache-unfriendly? If you enforced the same max distance from the original bucket as for the bitmap, i.e. 32, I’d expect it to perform just as well.

]]>