As part of my work on my key-value store project, I am currently researching hashing methods with the goal to find one that would fit the performance constraints of on-disk storage. In this article, I am making a quick review of cuckoo hashing, a method to resolve collisions in hash tables. This article is not part of the IKVS series as it is not specific to key-value stores.
1. Overview
Cuckoo hashing [1] is a method to resolve collisions in hash tables. It was first introduced by Pagh and Rodler [2]. The basic version of cuckoo hashing uses two hash functions hash1() and hash2(), associated to two separate tables, T1 and T2. Cuckoo hashing guarantees that an entry with key x and value a, denoted as <x,a>, will be found either in the bucket of index hash1(x) of table T1, or in the bucket of index hash2(x) of table T2. This means that at most two memory locations need to be tested in order to find an entry, which is an improvement compared to possibly dozens of tested locations in the case of basic linear probing. But this comes at a cost. In order to have fast lookups, the insertion process is rearranging entries, which can be very costly.
2. Entry insertion
In order to insert a new entry <x,a>, first its two possible buckets are tested, i.e. hash1(x) in T1 and hash1(x) in T2. If one of those buckets is available, the entry is stored in there. This is what is shown in Figure 1 below.
Figure 1
In case both buckets already contain entries, then a little data shuffling will be necessary. This case is presented in Figure 2. One of the hash functions and its table are chosen, for instance hash1() and table T1. Let’s denote as <y,b> the entry that is colliding with <x,a> in T1, that is to say the entry for which hash1(x) = hash1(y). The insertion will proceed by withdrawing entry <y,b> from its bucket, and by storing in the new entry <x,a> at this location. The insertion process then continues with <y,b>, and since its was just withdrawn from its bucket in T1, it cannot be stored there. The only possibility for <y,b> is its alternate bucket in T2, at index hash2(y). If this bucket is empty, <y,b> is stored there and the insertion process terminates. If there is an entry already in this bucket, that entry would have to be moved. The insertion continues until the final withdrawn entry is stored in an empty bucket. But there is a possibility for the displacement process to be a cycle, and for this reason, the number of displacements must be bounded by a constant. When the number of displacements reaches that constant, the table is resized.
Figure 2
3. Generalization
The example presented above had two hash functions and one entry per bucket. It is possible to generalize cuckoo hashing to more hash functions and more entries per bucket [2]. More hash functions means more lookups but less collisions, and more entries per bucket means less displacements and better cache utilization.
4. Discussion
As said earlier, what is great about cuckoo hashing is that searching for an entry requires a deterministic number of lookups, k lookups to be exact, where k is the number of hash functions and tables pairs being used. The downside is that as the load factor increases, so does the average number of operations in the chains of displacements. This slows down insertions considerably, so much actually that it is advised to keep the load factor below 0.5 in the case of Cuckoo hashing with two hash functions. Increasing the number of tables and the number of entries per bucket can minimize this drawback, as seen in [3]
Another downside is that even though the number of lookups are limited, they are always done in different tables, which means random memory accesses. With basic open addressing, the main advantage is that entries are found in nearby memory locations, which is cache-friendly. This advantage is completely lost with cuckoo hashing, as it is trading cache friendliness for fewer lookups. In the case of an in-memory data structure, those random accesses will occur in RAM, which is costly but depending on the application may not be so dramatic. On the contrary for data structures stored on disk, the lookups in different tables of cuckoo hashing will be translated into random disk accesses, which will hurt performance badly. Because I am mainly interested in data structures that perform well when stored on disk, I have chosen not to push the research further and not to implement cuckoo hashing, even for testing purposes.
There has been a lot of research around cuckoo hashing and possible variations, including concurrent and lock-free versions. One such example can be found here [3], which show promising results for in-memory dictionaries. Another interesting discussion around cuckoo hashing can be found here [4].
Join my email list
5. References
[1] Cuckoo hashing on Wikipedia
[2] Cuckoo hashing, Pagh and Rodler, 2001
[3] Optimistic Cuckoo Hashing for concurrent, read-intensive applications
[4] Some Open Questions Related to Cuckoo Hashing, Michael Mitzenmacher, 2009
I would like to consult with you regarding Flash Technologies. Please email me directly.
Tony Handal
What about when hash1(x) = hash1(y) and hash2(x) = hash2(y) ?