Implementing a Key-Value Store – Part 6: Open-Addressing Hash Tables

2014 May 7

This is Part 6 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts.

In this article, I will compare several open-addressing hash tables: Linear Probing, Hopscotch hashing, and Robin Hood hashing. I have already done some work on this topic, and in this article I want to gather data for more metrics in order to decide which hash table I will use for my key-value store.

The result section also contains an interesting observation about the maximum DIB for Robin Hood hashing, which originated from Kristofer Karlsson, a software engineer at Spotify and the author of the key-value store Sparkey.

This article will cover:

1. Open-addressing hash tables
2. Metrics
3. Experimental Protocol
4. Results and Discussion
5. Conclusion
6. References

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.

As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

read more…

Coding for SSDs – Part 6: A Summary – What every programmer should know about solid-state drives

2014 February 12

This is Part 6 over 6 of “Coding for SSDs”. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs.

In this part, I am summarizing the content from all the other parts in the form of concise self-contained paragraphs. Each paragraph is referencing one or more sections of the other parts, which allow to get more detailed information regarding each topic.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.

As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

Basics

1. Memory cell types

A solid-state drives (SSD) is a flash-memory based data storage device. Bits are stored into cells, which exist in three types: 1 bit per cell (single level cell, SLC), 2 bits per cell (multiple level cell, MLC), 3 bits per cell (triple-level cell, TLC).

See also: Section 1.1

2. Limited lifespan

Each cell has a maximum number of P/E cycles (Program/Erase), after which the cell is considered defective. This means that NAND-flash memory wears off and has a limited lifespan.

See also: Section 1.1

3. Benchmarking is hard

Tester are humans, therefore not all benchmarks are exempt of errors. Be careful when reading the benchmarks from manufacturers or third parties, and use multiple sources before trusting any numbers. Whenever possible, run your own in-house benchmarking using the specific workload of your system, along with the specific SSD model that you want to use. Finally, make sure you look at the performance metrics that matter most for the system at hand.

See also: Sections 2.2 and 2.3

Pages and blocks

4. NAND-flash pages and blocks

Cells are then grouped into a grid, called a block, and blocks are grouped into planes. The smallest unit through which a block can be read or written is a page. Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. For example, the Samsung SSD 840 EVO has blocks of size 2048 KB, and each block contains 256 pages of 8 KB each.

See also: Section 3.2

5. Reads are aligned on page size

It is not possible to read less than one page at once. One can of course only request just one byte from the operating system, but a full page will be retrieved in the SSD, forcing a lot more data to be read than necessary.

See also: Section 3.2

6. Writes are aligned on page size

When writing to an SSD, writes happen by increments of the page size. So even if a write operation affects only one byte, a whole page will be written anyway. Writing more data than necessary is known as write amplification. Writing to a page is also called “to program” a page.

See also: Section 3.2

7. Pages cannot be overwritten

A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”. The data is not updated in-place, as the “free” page is a different page than the page that originally contained the data. Once the data is persisted to the drive, the original page is marked as being “stale”, and will remain as such until it is erased.

See also: Section 3.2

8. Erases are aligned on block size

Pages cannot be overwritten, and once they become stale, the only way to make them free again is to erase them. However, it is not possible to erase individual pages, and it is only possible to erase whole blocks at once.

See also: Section 3.2

SSD controller and internals

9. Flash Translation Layer

The Flash Translation Layer (FTL) is a component of the SSD controller which maps Logical Block Addresses (LBA) from the host to Physical Block Addresses (PBA) on the drive. Most recent drives implement an approach called “hybrid log-block mapping” or one of its derivatives, which works in a way that is similar to log-structured file systems. This allows random writes to be handled like sequential writes.

See also: Section 4.2

10. Internal parallelism

Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”.

See also: Section 6

11. Wear leveling

Because NAND-flash cells are wearing off, one of the main goals of the FTL is to distribute the work among cells as evenly as possible so that blocks will reach their P/E cycle limit and wear off at the same time.

See also: Section 3.4

12. Garbage collection

The garbage collection process in the SSD controller ensures that “stale” pages are erased and restored into a “free” state so that the incoming write commands can be processed.

See also: Section 4.4

14. Background operations can affect foreground operations

Background operations such as garbage collection can impact negatively on foreground operations from the host, especially in the case of a sustained workload of small random writes.

See also: Section 4.4

Access patterns

15. Never write less than a page

Avoid writing chunks of data that are below the size of a NAND-flash page to minimize write amplification and prevent read-modify-write operations. The largest size for a page at the moment is 16 KB, therefore it is the value that should be used by default. This size depends on the SSD models and you may need to increase it in the future as SSDs improve.

See also: Sections 3.2 and 3.3


16. Align writes

Align writes on the page size, and write chunks of data that are multiple of the page size.

See also: Sections 3.2 and 3.3

17. Bufferize small writes

To maximize throughput, whenever possible keep small writes into a buffer in RAM and when the buffer is full, perform a single large write to batch all the small writes.

See also: Sections 3.2 and 3.3

18. To improve the read performance, write related data together

Read performance is a consequence of the write pattern. When a large chunk of data is written at once, it is spread across separate NAND-flash chips. Thus you should write related data in the same page, block, or clustered block, so it can later be read faster with a single I/O request, by taking advantage of the internal parallelism.

See also: Section 7.3

19. Separate read and write requests

A workload made of a mix of small interleaved reads and writes will prevent the internal caching and readahead mechanism to work properly, and will cause the throughput to drop. It is best to avoid simultaneous reads and writes, and perform them one after the other in large chunks, preferably of the size of the clustered block. For example, if 1000 files have to be updated, you could iterate over the files, doing a read and write on a file and then moving to the next file, but that would be slow. It would be better to reads all 1000 files at once and then write back to those 1000 files at once.

See also: Section 7.4

20. Invalidate obsolete data in batch

When some data is no longer needed or need to be deleted, it is better to wait and invalidate it in a large batches in a single operation. This will allow the garbage collector process to handle larger areas at once and will help minimizing internal fragmentation.

See also: Section 4.4

21. Random writes are not always slower than sequential writes

If the writes are small (i.e. below the size of the clustered block), then random writes are slower than sequential writes.
If writes are both multiple of and aligned to the size of a clustered block, the random writes will use all the available levels of internal parallelism, and will perform just as well as sequential writes. For most drives, the clustered block has a size of 16 MB or 32 MB, therefore it is safe to use 32 MB.

See also: Section 7.2

22. A large single-threaded read is better than many small concurrent reads

Concurrent random reads cannot fully make use of the readahead mechanism. In addition, multiple Logical Block Addresses may end up on the same chip, not taking advantage or of the internal parallelism. A large read operation will access sequential addresses and will therefore be able to use the readahead buffer if present, and use the internal parallelism. Consequently if the use case allows it, it is better to issue a large read request.

See also: Section 7.3

23. A large single-threaded write is better than many small concurrent writes

A large single-threaded write request offers the same throughput as many small concurrent writes, however in terms of latency, a large single write has a better response time than concurrent writes. Therefore, whenever possible, it is best to perform single-threaded large writes.

See also: Section 7.2

24. When the writes are small and cannot be grouped or buffered, multi-threading is beneficial

Many concurrent small write requests will offer a better throughput than a single small write request. So if the I/O is small and cannot be batched, it is better to use multiple threads.

See also: Section 7.2

25. Split cold and hot data

Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easier.

See also: Section 4.4

26. Bufferize hot data

Extremely hot data and other high-change metadata should be buffered as much as possible and written to the drive as infrequently as possible.

See also: Section 4.4

System optimizations

27. PCI Express and SAS are faster than SATA

The two main host interfaces offered by manufacturers are SATA 3.0 (550 MB/s) and PCI Express 3.0 (1 GB/s per lane, using multiple lanes). Serial Attached SCSI (SAS) is also available for enterprise SSDs. In their latest versions, PCI Express and SAS are faster than SATA, but they are also more expensive.

See also: Section 2.1

28. Over-provisioning is useful for wear leveling and performance

A drive can be over-provisioned simply by formatting it to a logical partition capacity smaller than the maximum physical capacity. The remaining space, invisible to the user, will still be visible and used by the SSD controller. Over-provisioning helps the wear leveling mechanisms to cope with the inherent limited lifespan of NAND-flash cells. For workloads in which writes are not so heavy, 10% to 15% of over-provisioning is enough. For workloads of sustained random writes, keeping up to 25% of over-provisioning will improve performance. The over-provisioning will act as a buffer of NAND-flash blocks, helping the garbage collection process to absorb peaks of writes.

See also: Section 5.2

29. 
Enable the TRIM command

Make sure your kernel and filesystem support the TRIM command. The TRIM command notifies the SSD controller when a block is deleted. The garbage collection process can then erase blocks in background during idle times, preparing the drive to face large writes workloads.

See also: Section 5.1

30. Align the partition

To ensure that logical writes are truly aligned to the physical memory, you must align the partition to the NAND-flash page size of the drive.

See also: Section 8.1

Conclusion

This summary concludes the “Coding for SSDs” article series. I hope that I was able to convey in an understandable manner what I have learned during my personal research over solid-state drives.

If after reading this series of articles you want to go more in-depth about SSDs, a good first step would be to read some of the publications and articles linked in the reference sections of Part 2 to 5.

Another great resource is the FAST conference (the USENIX Conference on File and Storage Technologies). A lot of excellent research is being presented there every year. I highly recommend their website, a good starting point being the videos and publications for FAST 2013.

Translations

This article was translated to Simplified Chinese by Xiong Duo.

Coding for SSDs – Part 5: Access Patterns and System Optimizations

2014 February 12

This is Part 5 over 6 of “Coding for SSDs”, covering Sections 7 and 8. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.

Now that I have covered most of the inner workings of solid-state drives in the previous sections, I can provide data that will help build an understanding of which access patterns should be used and why they are indeed better than others. In this part, I explain how writes should be done, how reads should be done, and why concurrent read and write operations are interfering. I also cover a few optimizations at the level of the filesystem which can improve performance.

 

ssd-presentation-05

 

read more…

Coding for SSDs – Part 4: Advanced Functionalities and Internal Parallelism

2014 February 12

This is Part 4 over 6 of “Coding for SSDs”, covering Sections 5 and 6. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.

In this part, I cover briefly some of the main SSD functionalities such as TRIM and over-provisioning. I am also presenting the different levels of internal parallelism in an SSD, and the concept of clustered block.

 

ssd-presentation-04

 

read more…

Coding for SSDs – Part 3: Pages, Blocks, and the Flash Translation Layer

2014 February 12

This is Part 3 over 6 of “Coding for SSDs”, covering Sections 3 and 4. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.

In this part, I am explaining how writes are handled at the page and block level, and I talk about the fundamental concepts of write amplification and wear leveling. Moreover, I describe what is a Flash Translation Layer (FTL), and I cover its two main purposes, logical block mapping and garbage collection. More particularly, I explain how write operations work in the context of a hybrid log-block mapping.

 

ssd-presentation-03

 

read more…

Coding for SSDs – Part 2: Architecture of an SSD and Benchmarking

2014 February 12

This is Part 2 over 6 of “Coding for SSDs”, covering Sections 1 and 2. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.

In this part, I am explaining the basics of NAND-flash memory, cell types, and basic SSD internal architecture. I am also covering SSD benchmarking and how to interpret those benchmarks.

 

ssd-presentation-02

 

read more…

Coding for SSDs – Part 1: Introduction and Table of Contents

2014 February 12

Introduction

I want to make solid-state drives (SSDs) the optimal storage solution for my key-value store project. For that reason, I had to make sure I fully understood how SSDs work, so that I can optimize my hash table implementation to suit their internal characteristics. There is a lot of incomplete and contradictory information out there, and finding trustworthy information about SSDs was not an easy task. I had to do some substantial reading to find the proper publications and benchmarks in order to convince myself that, if I had to be coding for SSDs, I would know what I was doing.

Then I figured that since I had done all the research, it would be useful to share the conclusions I had reached. My intent was to transform all the information already available into practical knowledge. I ended up writing a 30-page article, not very suitable for the format of a blog. I have therefore decided to split what I had written into logical parts that can be digested independently. The full Table of Contents is available at the bottom of this article.

The most remarkable contribution is Part 6, a summary of the whole “Coding for SSDs” article series, that I am sure programmers who are in a rush will appreciate. This summary covers the basics of SSDs along with all of the recommended access patterns on how reads and writes should be implemented to get the best performance out of solid-state drives.

Another important detail is that “Coding for SSDs” is independent from my key-value store project (IKVS series), and therefore, no prior knowledge of the IKVS articles is needed. I am planning on writing an article for the IKVS series, on how hash table can be implemented to take advantage of the internal characteristics of SSDs, though I have no precise date for that yet.

My only regret is not to have produced any code of my own to prove that the access patterns I recommend are actually the best. However even with such code, I would have needed to perform benchmarks over a large array of different models of solid-state drives to confirm my results, which would have required more time and money than I can afford. I have cited my sources meticulously, and if you think that something is not correct in my recommendations, please leave a comment to shed light on that. And of course, feel free to drop a comment as well if you have questions or would like to contribute in any way.

Finally, remember to subscribe to the newsletter to receive a notification email every time a new article is posted on Code Capsule. The subscription panel is available at the top right corner of the blog.

UPDATE: This series of articles has been translated to Simplified Chinese by Xiong Duo.

Table of Content

Part 1: Introduction and Table of Contents

Part 2: Architecture of an SSD and Benchmarking

    1. Structure of an SSD

        1.1 NAND-flash memory cells
        1.2 Organization of an SSD
        1.3 Manufacturing process

    2. Benchmarking and performance metrics

        2.1 Basic benchmarks
        2.2 Pre-conditioning
        2.3 Workloads and metrics

Part 3: Pages, Blocks, and the Flash Translation Layer

    3. Basic operations

        3.1 Read, write, erase
        3.2 Example of a write
        3.3 Write amplification
        3.4 Wear leveling

    4. Flash Translation Layer (FTL)

        4.1 On the necessity of having an FTL
        4.2 Logical block mapping
        4.3 Notes on the state of the industry
        4.4 Garbage collection

Part 4: Advanced Functionalities and Internal Parallelism

    5. Advanced functionalities

        5.1 TRIM
        5.2 Over-provisioning
        5.3 Secure Erase
        5.4 Native Command Queueing (NCQ)
        5.5 Power-loss protection

    6. Internal Parallelism in SSDs

        6.1 Limited I/O bus bandwidth
        6.2 Multiple levels of parallelism
        6.3 Clustered blocks

Part 5: Access Patterns and System Optimizations

    7. Access patterns

        7.1 Defining sequential and random I/O operations
        7.2 Writes
        7.3 Reads
        7.4 Concurrent reads and writes

    8. System optimizations

        8.1 Partition alignment
        8.2 Filesystem parameters
        8.3 Operating system I/O scheduler
        8.4 Swap
        8.5 Temporary files

Part 6: A Summary – What every programmer should know about solid-state drives

What’s next

Part 2 is available here. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.

As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

Translations

This article was translated to Simplified Chinese by Xiong Duo.

Robin Hood hashing: backward shift deletion

2013 November 17

In my article about Robin Hood hashing [1], I had reached the conclusion that the Robin Hood hashing algorithm was performing poorly after deletions had occurred in the hash table, and I was quite disappointed with my results. In the research I had done, I had stumbled upon an article by Paul Khuong who had done some work on Robin Hood hashing [2], and I decided to email him to ask what he thought of my results. Paul got back to me, and after looking at my code on Github, he suggested that the performance could be greatly improved with only a little change to the removal method. He even submitted a patch to point out what the change would be, and he ran this modified version himself on the test cases I had included in the source code. The results he got were looking really good, and therefore I had no other choice but to write a follow-up article to present the change to the code and show those results.

In this article, I am presenting the backward shift deletion variant for the Robin Hood hashing algorithm. I start by explaining what backward shift deletion is and how to implement it, then I use it on the same test cases that I had run in my previous article. Finally, I discuss the new results.

robin-hood-hashing-backward-shift-more

read more…

Robin Hood hashing

2013 November 11

As I am still doing research on open addressing hash table algorithms, I tested an approach called Robin Hood hashing. Just like for other reordering schemes, Robin Hood hashing will move entries that are already stored in the hash table as new items are inserted. In the original Robin Hood paper, it is claimed that Robin Hood hashing is very efficient due to the fact that the probe sequence length has a small and almost constant variance.

In this article, I am presenting how Robin Hood hashing handles insertion, deletion, and search, including a graphical representation for the insertion operation. I then use my own C++ implementation of Robin Hood hashing to see how the algorithm behaves under various load conditions and try to determine how efficient it really is. Finally, I discuss some of the drawbacks of Robin Hood hashing, and its applicability for the implementation of an on-disk key-value store.

robin-hood-hashing-more-web

read more…

Hopscotch hashing

2013 August 11
by Emmanuel Goossaert

I am currently experimenting with various hash table algorithms, and I stumbled upon an approach called hopscotch hashing. Hopscotch hashing is a reordering scheme that can be used with the open addressing method for collision resolution in hash tables. When using open addressing with only a probing sequence and no reordering, entries are inserted in the first empty buckets found in the sequence. With a reordering scheme, entries already in the table can be moved as new entries are inserted. Hopscotch hashing is interesting because it guarantees a small number of look-ups to find entries. In addition, those look-ups are in contiguous memory areas, which is very cache friendly.

This article makes minor contributions to the original publication. First, a clear explanation of the insertion process is being given, which is independent from the representation of neighborhoods. The original paper was using the bitmap representation to present the algorithm, and I believe that things are simpler without it. Second, a new neighborhood representation is introduced, it is called the “shadow” representation. It derives the relationship between neighborhoods and initial buckets from the hashed keys, and does not require any extra memory. Finally, a C++ implementation is provided, and experimentation with this implementation allows to shed light on some of the limitations of hopscotch hashing that were not expressed in the original paper.

hopscotch-hashing_more_web

read more…