Coding for SSDs – Part 6: A Summary – What every programmer should know about solid-state drives

2014 February 12

This is Part 6 over 6 of “Coding for SSDs”. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs.

In this part, I am summarizing the content from all the other parts in the form of concise self-contained paragraphs. Each paragraph is referencing one or more sections of the other parts, which allow to get more detailed information regarding each topic.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.
As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

Basics

1. Memory cell types

A solid-state drives (SSD) is a flash-memory based data storage device. Bits are stored into cells, which exist in three types: 1 bit per cell (single level cell, SLC), 2 bits per cell (multiple level cell, MLC), 3 bits per cell (triple-level cell, TLC).

See also: Section 1.1

2. Limited lifespan

Each cell has a maximum number of P/E cycles (Program/Erase), after which the cell is considered defective. This means that NAND-flash memory wears off and has a limited lifespan.

See also: Section 1.1

3. Benchmarking is hard

Testers are humans, therefore not all benchmarks are exempt of errors. Be careful when reading the benchmarks from manufacturers or third parties, and use multiple sources before trusting any numbers. Whenever possible, run your own in-house benchmarking using the specific workload of your system, along with the specific SSD model that you want to use. Finally, make sure you look at the performance metrics that matter most for the system at hand.

See also: Sections 2.2 and 2.3

Pages and blocks

4. NAND-flash pages and blocks

Cells are grouped into a grid, called a block, and blocks are grouped into planes. The smallest unit through which a block can be read or written is a page. Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. For example, the Samsung SSD 840 EVO has blocks of size 2048 KB, and each block contains 256 pages of 8 KB each.

See also: Section 3.2

5. Reads are aligned on page size

It is not possible to read less than one page at once. One can of course only request just one byte from the operating system, but a full page will be retrieved in the SSD, forcing a lot more data to be read than necessary.

See also: Section 3.2

6. Writes are aligned on page size

When writing to an SSD, writes happen by increments of the page size. So even if a write operation affects only one byte, a whole page will be written anyway. Writing more data than necessary is known as write amplification. Writing to a page is also called “to program” a page.

See also: Section 3.2

7. Pages cannot be overwritten

A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”. The data is not updated in-place, as the “free” page is a different page than the page that originally contained the data. Once the data is persisted to the drive, the original page is marked as being “stale”, and will remain as such until it is erased.

See also: Section 3.2

8. Erases are aligned on block size

Pages cannot be overwritten, and once they become stale, the only way to make them free again is to erase them. However, it is not possible to erase individual pages, and it is only possible to erase whole blocks at once.

See also: Section 3.2

SSD controller and internals

9. Flash Translation Layer

The Flash Translation Layer (FTL) is a component of the SSD controller which maps Logical Block Addresses (LBA) from the host to Physical Block Addresses (PBA) on the drive. Most recent drives implement an approach called “hybrid log-block mapping” or one of its derivatives, which works in a way that is similar to log-structured file systems. This allows random writes to be handled like sequential writes.

See also: Section 4.2

10. Internal parallelism

Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”.

See also: Section 6

11. Wear leveling

Because NAND-flash cells are wearing off, one of the main goals of the FTL is to distribute the work among cells as evenly as possible so that blocks will reach their P/E cycle limit and wear off at the same time.

See also: Section 3.4

12. Garbage collection

The garbage collection process in the SSD controller ensures that “stale” pages are erased and restored into a “free” state so that the incoming write commands can be processed.

See also: Section 4.4

13. Background operations can affect foreground operations

Background operations such as garbage collection can impact negatively on foreground operations from the host, especially in the case of a sustained workload of small random writes.

See also: Section 4.4

Access patterns

14. Never write less than a page

Avoid writing chunks of data that are below the size of a NAND-flash page to minimize write amplification and prevent read-modify-write operations. The largest size for a page at the moment is 16 KB, therefore it is the value that should be used by default. This size depends on the SSD models and you may need to increase it in the future as SSDs improve.

See also: Sections 3.2 and 3.3


15. Align writes

Align writes on the page size, and write chunks of data that are multiple of the page size.

See also: Sections 3.2 and 3.3

16. Buffer small writes

To maximize throughput, whenever possible keep small writes into a buffer in RAM and when the buffer is full, perform a single large write to batch all the small writes.

See also: Sections 3.2 and 3.3

17. To improve the read performance, write related data together

Read performance is a consequence of the write pattern. When a large chunk of data is written at once, it is spread across separate NAND-flash chips. Thus you should write related data in the same page, block, or clustered block, so it can later be read faster with a single I/O request, by taking advantage of the internal parallelism.

See also: Section 7.3

18. Separate read and write requests

A workload made of a mix of small interleaved reads and writes will prevent the internal caching and readahead mechanism to work properly, and will cause the throughput to drop. It is best to avoid simultaneous reads and writes, and perform them one after the other in large chunks, preferably of the size of the clustered block. For example, if 1000 files have to be updated, you could iterate over the files, doing a read and write on a file and then moving to the next file, but that would be slow. It would be better to reads all 1000 files at once and then write back to those 1000 files at once.

See also: Section 7.4

19. Invalidate obsolete data in batch

When some data is no longer needed or need to be deleted, it is better to wait and invalidate it in a large batches in a single operation. This will allow the garbage collector process to handle larger areas at once and will help minimizing internal fragmentation.

See also: Section 4.4

20. Random writes are not always slower than sequential writes

If the writes are small (i.e. below the size of the clustered block), then random writes are slower than sequential writes.
If writes are both multiple of and aligned to the size of a clustered block, the random writes will use all the available levels of internal parallelism, and will perform just as well as sequential writes. For most drives, the clustered block has a size of 16 MB or 32 MB, therefore it is safe to use 32 MB.

See also: Section 7.2

21. A large single-threaded read is better than many small concurrent reads

Concurrent random reads cannot fully make use of the readahead mechanism. In addition, multiple Logical Block Addresses may end up on the same chip, not taking advantage or of the internal parallelism. A large read operation will access sequential addresses and will therefore be able to use the readahead buffer if present, and use the internal parallelism. Consequently if the use case allows it, it is better to issue a large read request.

See also: Section 7.3

22. A large single-threaded write is better than many small concurrent writes

A large single-threaded write request offers the same throughput as many small concurrent writes, however in terms of latency, a large single write has a better response time than concurrent writes. Therefore, whenever possible, it is best to perform single-threaded large writes.

See also: Section 7.2

23. When the writes are small and cannot be grouped or buffered, multi-threading is beneficial

Many concurrent small write requests will offer a better throughput than a single small write request. So if the I/O is small and cannot be batched, it is better to use multiple threads.

See also: Section 7.2

24. Split cold and hot data

Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easier.

See also: Section 4.4

25. Buffer hot data

Extremely hot data and other high-change metadata should be buffered as much as possible and written to the drive as infrequently as possible.

See also: Section 4.4

System optimizations

26. PCI Express and SAS are faster than SATA

The two main host interfaces offered by manufacturers are SATA 3.0 (550 MB/s) and PCI Express 3.0 (1 GB/s per lane, using multiple lanes). Serial Attached SCSI (SAS) is also available for enterprise SSDs. In their latest versions, PCI Express and SAS are faster than SATA, but they are also more expensive.

See also: Section 2.1

27. Over-provisioning is useful for wear leveling and performance

A drive can be over-provisioned simply by formatting it to a logical partition capacity smaller than the maximum physical capacity. The remaining space, invisible to the user, will still be visible and used by the SSD controller. Over-provisioning helps the wear leveling mechanisms to cope with the inherent limited lifespan of NAND-flash cells. For workloads in which writes are not so heavy, 10% to 15% of over-provisioning is enough. For workloads of sustained random writes, keeping up to 25% of over-provisioning will improve performance. The over-provisioning will act as a buffer of NAND-flash blocks, helping the garbage collection process to absorb peaks of writes.

See also: Section 5.2

28. 
Enable the TRIM command

Make sure your kernel and filesystem support the TRIM command. The TRIM command notifies the SSD controller when a block is deleted. The garbage collection process can then erase blocks in background during idle times, preparing the drive to face large writes workloads.

See also: Section 5.1

29. Align the partition

To ensure that logical writes are truly aligned to the physical memory, you must align the partition to the NAND-flash page size of the drive.

See also: Section 8.1

Conclusion

This summary concludes the “Coding for SSDs” article series. I hope that I was able to convey in an understandable manner what I have learned during my personal research over solid-state drives.

If after reading this series of articles you want to go more in-depth about SSDs, a good first step would be to read some of the publications and articles linked in the reference sections of Part 2 to 5.

Another great resource is the FAST conference (the USENIX Conference on File and Storage Technologies). A lot of excellent research is being presented there every year. I highly recommend their website, a good starting point being the videos and publications for FAST 2013.

Translations

This article was translated to Simplified Chinese by Xiong Duo.

26 Responses leave one →
  1. Dan permalink
    February 20, 2014

    “blocks are grouped into planes. The smallest unit through which a block can be read or written is a page.”

    Did you mean ‘page’ instead of ‘plane’?

    • Michael Aquilina permalink
      February 20, 2014

      Ye I was wondering the same thing here. He does not mention plane again so im guessing page is what he meant.

      • February 20, 2014

        What I meant is that there are multiple pages in a block, and multiple blocks in a plane. A plane is just another architectural unit, which is a group of multiple blocks. Obviously the wording is confusing, I will edit this soon. Thanks for pointing that out!

        • Doug permalink
          February 21, 2015

          A single plane can be written or erased at one time. There can be 2-4 planes in a NAND part. Writing to one plane means it is tied up and not available for a read or another write operation.

  2. Ata Roboubi permalink
    February 20, 2014

    Thank you for detailing this stuff. I had a quick read.

    For the sake of other readers, if you don’t have time to read Part 6, here is the summary of the summary:

    If you are using a decent OS like linux, as opposed to direct hardware access (who other than google ever did that?), all you need to worry about is how to co-locate your data. By “using”, I don’t mean using O_DIRECT, which basically says, “OS, please get out of my way.”

    If you do use O_DIRECT, do “man 2 open” and search the man page for “monkey”. If that’s not enough, google “Linus O_DIRECT” and look at the war with Oracle and the like. You could probably also google “Linus Braindead” and you are likely to find a post about O_DIRECT.

    If you are one of the few people actually working on the kernel’s disk layer, you probably know all this and is unlikely you will ever be reading this.

    As for co-locating data, there is no way to do it without knowing your app inside out. So know your app. For some apps, it can make orders of magnitude difference. That’s your domain. Leave the disk to the kernel. Let it use it’s cache and it’s scheduler. They will be better than yours and will benefit the whole system, in case other apps need to get at the data too. You can try to help the kernel by doing vectorized read/writes although that’s a bigger deal with spinning disks.

    Ata Roboubi

  3. Rajiv permalink
    February 23, 2014

    Thank you for this great series. If you are taking requests, it would be awesome to get some practical examples of these principles being put to practice. For example: How data structures like BTrees can be tuned to work better on SSDs or how entirely new data structures should be used to properly utilize SSDs.

    • February 26, 2014

      I am planning on writing an article explaining how to implement a hash table for SSDs as part of my key-value store project. I’m quite busy at the moment though, so that may take some time :)

      Meanwhile, you can take a look at LevelDB and SILT, or any log-structured approach.

      LevelDB: https://code.google.com/p/leveldb/
      SILT: http://www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf

      • Doug permalink
        February 21, 2015

        I work on SSD firmware, so maybe have more details than most people.
        Sending big block transfers is faster, no matter the technology used. There is less communication overhead.
        You don’t really know all that goes on inside the drive. Drives can have different types and sizes of NAND inside, even in one drive, much less system to system. Trying to match the block size probably won’t get you a huge gain long term. Usually the external block size does not match the internal NAND write size, the firmware hides the details.
        Optimizing an algorithm for SSD seems like a really hard way to solve a performance problem. You would be better off putting drives in a RAID, and buying more RAM.
        If you want to optimize one drive buy an over-provisioned drive that uses PCIe and NVMe with multiple lanes.

        • February 26, 2015

          Thanks for your insights! Putting drives in RAID certainly improves performance, but in many cases this solution cannot be used as the machines at hand only have a single drive.

          What do you think is a good approach for workloads that are a mix of reads and writes?

          What do you think of the efforts to optimize for SSDs on the side of the kernel and file system, like Samsung’s F2FS file system for example? They’ve shown improvements for random writes in the range of 2-3x.

  4. Nate permalink
    May 11, 2014

    Excellent stuff! I am in the NAND industry and this is one of the best overall write ups I have seen. You obviously put a lot of effort into your research.

  5. ivalice permalink
    May 14, 2014

    Great Work! Conclude most of the research about NAND Flash SSD. It helps me understand some basic ideas.

  6. Sandeep Gautham permalink
    May 22, 2014

    Excellent write up for starters.

  7. Nidhi permalink
    September 30, 2014

    Excellent material for beginners but yes to work actually you need to learn more from publications. but thanks a lot for this stuff.

  8. Babu Alluri permalink
    November 9, 2014

    Excellent write up for beginners of SSDs. I could not stop my self until I completed the series; as I did for my study of programming in BASIC, in a day, 25 years ago. As some one suggested, the next step will be good add data structures, pseudo algos … I just came up with my words – keep the growing grass green.

  9. February 15, 2015

    Excellent information thank you

  10. February 15, 2015

    You make it seem that only flash based products are on the market for SSD, the fact is; there is a lot more than just flash based products on the market. For example, the original SSD are based on various technologies of RAM that is battery or capacitor backed. Usually these types of SSD were based on static RAMs because it required a lot less current to keep the contents of the memory unchanged when the unit they were connecting to was turned off.

    The smartest thing someone could do, is if they’re going to do a lot of swapping in and out of physical memory and onto that of a SSD, would be to get a RAM based device instead of a Flash, because they’re faster and it allows for quicker context switch.

    These are best for swap files or other forms of cache. The thing that wears out the Flash memories the in the long run is the constant writing to the memory, that and the power cycling.

    If you can create an algorithm to work with a RAM based SSD and when your final product is ready to be put on a SSD that is Flash based, then dump it to that drive, other than that, avoid it at all costs.

    Of course, the best idea is to get a computer motherboard that is capable of handling 64 GB RAM and larger, use a 64-bit OS and load down your system with a lot of RAM. Load from the SSD that is Flash based, keep everything you use in main memory. That will save your flash drive.

    • February 17, 2015

      While RAM is faster than flash-based memory, and RAM-based SSDs are faster than flash-based SSDs, RAM is a different technology with a completely different physical design. Building a system by storing data in the RAM implies making assumptions about memory access patterns that are fast on RAM, but that will inevitably turn out to be very harmful for flash-based memory. A very important difference is that RAM can be updated in-place, but not flash-based memory. Another important difference is that flash-based memory has P/E cycles, which means that blocks need to be garbage collected and erased every once in a while. Those two differences alone mean that if you switch the data storage for a system from RAM to flash, and considering that you’re interested in sheer throughput performance, then you will have to revise the way that you layout the data in your storage, and very possibly part of your code/algorithm as well.

      I agree that for most systems which need high throughput, storing data in RAM is the best solution, however there are still many problems out there that are just too big to hold in the amount of RAM one can either physically have or afford nowadays. In those instances you will have to use a secondary storage, most likely a HDD or a SSD, and you’ll have to optimize for whatever physical design that storage solution has.

  11. Milan permalink
    February 16, 2015

    Have you thought about making a PDF, EPUB and MOBI output of this article series?

  12. Narges permalink
    August 21, 2015

    I want to know more about garbage collection. There is a threshold on the number of free pages, when it reached the garbage collection starts. How long does the garbage collection continue cleaning? What is the normal value for the gc threshold?

Trackbacks and Pingbacks

  1. Coding for SSDs – Part 3: Pages, Blocks, and the Flash Translation Layer | Code Capsule
  2. Resumen de lo que cada programador debería saber sobre discos de estado sólido [Eng]
  3. The Week That Was for week ending Feb 22, 2015 | Practical Performance Analyst
  4. F2FS: A new file system for flash storage | the morning paper
  5. Coding for SSDs – Part 4: Advanced Functionalities and Internal Parallelism | Code Capsule
  6. Ssd - Page 2

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS