7 Responses leave one →
  1. Simon permalink
    October 18, 2014

    I enjoyed this blog post. Here are some comments:

    On mmap(): Earlier this year I wrote a key value store called SharedHashFile [1] which only uses mmap() and it is very fast and can achieve over 10 million reads per second on a single box with 100 million short keys. I don’t think it would be possible to achieve this kind of performance using system calls to perform I/O. Why? Because of the non-zero-copy nature of using the system calls. However, I agree with you that write speeds can become a problem because the OS decides when to write the memory mapped page to backing store, and the user program has no control over this AFAIK. This seems to be okay for read heavy key value store functionality.

    Another problem with memory maps is corruption. If the interface to the hash table is code and the memory maps exist inside the process of the hash table caller, then the potentially buggy code of the caller can simply corrupt the hash table memory maps causing problems. An example of a key value store which has tried to work around this — but with enormous performance trade-offs — is lightningdb [2]. Here the key value store is using read-only memory maps at run-time for fast access where the caller process cannot corrupt the memory maps. However, writing to the key value store is very slow in comparison and also cannot be done asynchronously.

    On cache lines: Even using memory mapped files — and assuming everything is in memory — I have found that the performance is largely dictated by the number of cache line fetches. Therefore is it important to have an algorithm with reduces these to the minimum. SharedHashFile manages to reduce most key accesses to three cache line fetches.

    On locking: Probably the other biggest thing influencing performance is lock contention. So SharedHashFile uses a fair read write spin lock. This relies on atomic instructions which are slower than regular instructions. I believe that to achieve higher performance it will be necessary to reduce the number of atomic instructions and this can probably only be done at the algorithm level?

    On zero copy: I think this is very difficult or impossible to achieve. Why? When the hash table grows then it’s going to be necessary to internally relocate keys and values otherwise one will end up with memory holes due to other keys and values being deleted. For this reason alone then it’s going to be necessary for a key look-up to return a copy of a value instead of a pointer to the bytes in memory. A secondary reason is that an asynchronous write to the same key may want to update the value just after it has been read. Things get very complicated for the caller if it is suddenly responsible for locking access to the value itself.

    What else to think about? How the key value store grows? For the next generation of SharedHashFile then I am thinking about how to reduce the initial, smallest size of the key value store. Why? I would like to implement a hierarchical key value store. There are two obvious ways to do this: (1) By building the hierarchy on top of the existing key value store using the existing keys. This would have the disadvantage that if you have a folder with e.g. 3 keys in it then those keys would be just regular keys and therefore could be distributed anywhere within the n keys of the complete set of all keys. A secondary disadvantage would be that to delete a complete folder then one would have to iterate over all keys and delete them individually. (2) By creating each folder in the hierarchy from a new instance of the key value store. This means that in the example with a folder with 3 keys that the keys would be stored close to each other in memory for better access. And that to delete the folder, it would be possible to not iterate over all keys, especially if the entire folder key value store is stored in the values of keys in the parent folder… just delete those keys. This scheme would not work today very well with SharedHashFile because the smallest over head for a key value store is substantial, meaning that memory would quickly be exhausted if creating a hierarchical key value store with many folders.

    I know you’ve done a lot of research around key value store technology and I’d love to hear your thoughts on these topics. And I will follow the progress of KingDB with interest. Good luck!


    [1] https://github.com/simonhf/sharedhashfile
    [2] http://symas.com/mdb/

    • October 23, 2014

      Wow, that’s an awesome comment! Thank you for sharing your results on SharedHashFile, I took at look at the source code, that’s a cool project 🙂

      In the case of SharedHashFile, you’re definitely right when you say that minimizing cache line access is critical. Also, as the size of the table increases, the average number of cache lines accessed on each read operation will increase too, and my expectation is that performance will take a blow every time the average distance to initial bucket gets larger than the size of one of the CPU caches.

      And I agree with you when you say that writing to memory maps is dangerous. The problem is the same with regular writes, and will always happen whenever data is overwritten. Most database systems cope with that by never overwriting anything and by first persisting the incoming changes to a new location on the drive, then updating the actual data structure. Write-ahead logging and ARIES are examples of such techniques. KingDB uses a log-structured approach to cope with the problem.

      The code of KingDB is still in progress, but so far I have decided not to implement the hash table myself, and simply to use an STL multimap. That multimap associate one or more locations on secondary storage for any hash of a key. Thus if the key value store holds entries of the form , the multimap will hold . The bottleneck for me is going to be in the file I/O, thus not optimizing for cache line accesses is fine. Also, incoming updates are persisted on the drive as soon as they arrive, in a log-structed way. It’s not as efficient as memory mapping everything, but it keeps the code simple and limit the amount of data that needs to be moved around. More on that in the next blog articles! 🙂

      Finally, your current results of 10M reads per second with SharedHashFile are very impressive! Although, you mentioned that you were using short-length keys, thus except for the initial load of each file system page from secondary storage, chances are that the entire working set is read from memory. I’d be curious to know more about this test: what size are your entries for, and what is the performance with larger entries? Also, what is the load factor of your hash table at the moment the reads are performed?

      • Simon permalink
        November 10, 2014

        Sorry I did not reply earlier… is it intended that the blog would notify me of your reply? Because it did not 🙁

        Regarding the comment “as the size of the table increases, the average number of cache lines accessed on each read operation will increase too”: This is exactly the thing to avoid, and it’s also what SharedHashFile is designed to avoid; the average distance to any key and value should remain constant no matter how big the key value store becomes. Does KingDB also guarantee this? And if so, how?
        I disagree with you that cache line optimization should be discarded if there is a bigger bottleneck, e.g. file I/O. Why? Let’s say the hashtable holds 100 million unique keys. Let’s say those keys belong to 100,000 objects and each object has one average 1,000 keys. If we can arrange the key value store is some sort of hierarchy (which I am dreaming about for the next version of SHF) then it may be possible to group that set of 1,000 keys closer together within the entire data set. This potentially means 1,000 time less I/O disk accesses than if the keys are scattered randomly across the 100 million keys. You can potentially have access to the 1,000 keys in-memory after just a couple of disk accesses. Therefore, the access speed of the remaining 999 keys — after the first key is read from disk — might remain critical and therefore cache line usage might remain critical. It’s a bit unfair talking about this if KingDB is intended to be a flat key value. But in a world where OO is very fashionable — and to some unthinkable not to have — then wouldn’t it be great if we could design a key value store whereby certain keys can be grouped together as if in an object? Here the key access time would profit from being ‘closer’ to other keys on disk.

        Regarding the performance figures for SharedHashFile: There are lots of tests shipped with SHF. So it’s easy to build and run the tests on your own box. I haven’t tried a test with larger entries but that sounds like an interesting thing to do 🙂 An interesting thing about the SharedHashFile algorithm (described in the header file) is that the “load factor” is not critical to performance. You can start off with 1 key, and add n keys incrementally and the read and write performance should always be the same as long as you have enough main memory 🙂

        • November 22, 2014

          The blog didn’t have a comment notification option. Thank you for pointing that out, I just installed a plugin for that and now all readers can subscribe to comments! 🙂

          The way that KingDB deals with collisions is that it groups all entries with the same hashed keys and store them contiguously on disk. Because KingDB uses a log-structure approach, the initial writes will obviously not do that, and the grouping will occur during the compaction process. Once the compaction process has happened, all entries hashing to the same bucket are stored together, which allows the retrieval of the correct one to be done with a single sequential read operation. Of course, the compaction process incurs more writes than required as it needs to rewrite entries in order to group them, but that’s always a choice that the design of a database has to make: optimize for reads, or optimize for writes. In the case of KingDB, bursty writes are optimized by the log-structured storage, and on the long run, reads will be optimized due to the grouping in the compaction process.

          A hierarchical structure is a good idea to minimize random reads, like what B-trees are doing. But this come at the cost of more complexity in the code and reduced performance with large batches of writes.

          In the end there is no best solution, it all depends on the workload that the database will have to deal with.

  2. Raphael S. Carvalho permalink
    December 4, 2014

    I’m trying to understand the following sentence: ‘As a direct consequence, any algorithm trying to be clever by applying updates in-place will bring no performance gain, and will needlessly increase code complexity.’ Isn’t it supposed to be avoiding updates in-place instead? If not, could you please explain it with other words?

    • December 4, 2014

      With HDDs, you can get chunks of contiguous logical memory mapped to contiguous physical memory on the disk. In that case to speed up reads, you would want to store you data in the same sequential order that’s it’s going to be read, and thus it makes sense to sometimes re-order data on the disk to guarantee this sequentiality. I could be wrong, but I believe that this is what is happening with primary keys in InnoDB (MySQL), i.e. data is being re-order by the InnoDB engine to be sorted by primary key, in order to speed up primary key lookups.

      A lot of the “sequential ordering” philosophy is still around, but it no longer applies to systems that use SSDs for their storage [1]. Same goes with reserving sequential space in prevision for the future growth of a database, and then doing in-place updates in those locations. Therefore yes, with SSDs it’s better to avoid in-place updates, and one of the most common approaches to solve that nowadays is to use log-structured storage.

      Was this helpful? 🙂

      [1] http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/

  3. May 19, 2019

    Typo: “best thing known to ^mankind^ after hash tables.”
    I haven’t read the SSD article fully, but I curious about the fact that if you had to summarize writing a db for HDs vs SSDs vs the new NVMe SSDs, what do you think the main things to take into account are?

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS