How to get started with infrastructure and distributed systems

2016 January 3
by Emmanuel Goossaert

Most of us developers have had experience with web or native applications that run on a single computer, but things are a lot different when you need to build a distributed system to synchronize dozens, sometimes hundreds of computers to work together.

I recently received an email from someone asking me how to get started with infrastructure design, and I thought that I would share what I wrote him in a blog post if that can help more people who want to get started in that as well.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.

A basic example: a distributed web crawler

For multiple computers to work together, you need some sort of synchronization mechanisms. The most basic ones are databases and queues. Part of your computers are producers or masters, and another part are consumers or workers. The producers write data in a database, or enqueue jobs in a queue, and the consumers read the database or queue. The database or queue system runs on a single computer, with some locking, which guarantees that the workers don’t pick the same work or data to process.

Let’s take an example. Imagine you want to implement a web crawler that downloads web pages along with their images. One possible design for such a system will require the following components:

  • Queue: the queue contains the URLs to be crawled. Processes can add URLs to the queue, and workers can pick up URLs to download from the queue.
  • Crawlers: the crawlers pick URLs from the queue, either web pages or images, and download them. If a URL is a webpage, the crawlers also look for links in the page, and push all those links to the queue for other crawlers to pick them up. The crawlers are at the same time the producers and the consumers.
  • File storage: The file storage stores the web pages and images in an efficient manner.
  • Metadata: a database, either MySQL-like, Redis-like, or any other key-value store, will keep track of which URL has been downloaded already, and if so where it is stored locally.

The queue and the crawlers are their own sub-systems, they communicate with external web servers on the internet, with the metadata database, and with the file storage system. The file storage and metadata database are also their own sub-systems.

Figure 1 below shows how we can put all the sub-systems together to have a basic distributed web crawler. Here is how it works:

1. A crawler gets a URL from the queue.
2. The crawler checks in the database if the URL was already downloaded. If so, just drop it.
3. The crawler enqueues the URLs of all links and images in the page.
4. If the URL was not downloaded recently, get the latest version from the web server.
5. The crawler saves the file to the File Storage system: it talks to a reserse proxy that’s taking incoming requests and dispatching them to storage nodes.
6. The File Storage distributes load and replicates data across multiple servers.
7. The File Storage update the metadata database so we know which local file is storing which URL.


Figure 1: Architecture of a basic distributed web crawler

The advantage of a design like the one above is that you can scale up independently each sub-system. For example, if you need to crawl stuff faster, just add more crawlers. Maybe at some point you’ll have too many crawlers and you’ll need to split the queue into multiple queues. Or maybe you realize that you have to store more images than anticipated, so just add a few more storage nodes to your file storage system. If the metadata is becoming too much of a centralized point of contention, turn it into a distributed storage, use something like Cassandra or Riak for that. You get the idea.

And what I have presented above is just one way to build a simple crawler. There is no right or wrong way, only what works and what doesn’t work, considering the business requirements.

Talk to people who are doing it

The one unique way to truly learn how to build a distributed system is to maintain or build one, or work with someone who has built something big before. But obviously, if the company you’re currently working at does not have the scale or need for such a thing, then my advice is pretty useless…

Go to and find groups in your geographic area that talk about using NoSQL data storage systems, Big Data systems, etc. In those groups, identify the people who are working on large-scale systems and ask them questions about the problems they have and how they solve them. This is by far the most valuable thing you can do.

Basic concepts

There are a few basic concepts and tools that you need to know about, some sort of alphabet of distributed systems that you can later on pick from and combine to build systems:

  • Concepts of distributed systems: read a bit about the basic concepts in the field of Distributed Systems, such as consensus algorithms, consistent hashing, consistency, availability and partition tolerance.
  • RDBMs: relational database management systems, such as MySQL or PostgreSQL. RDMBs are one of the most significant invention of humankind from the last few decades. They’re like Excel spreadsheets on steroid. If you’re reading this article I’m assuming you’re a programmer and you’ve already worked with relational databases. If not, go read about MySQL or PostgreSQL right away! A good resource for that is the web site
  • Queues: queues are the simplest way to distribute work among a cluster of computers. There are some specific projects tackling the problem, such as RabbitMQ or ActiveMQ, and sometimes people just use a table in a good old database to implement a queue. Whatever works!
  • Load balancers: if queues are the basic mechanism for a cluster of computer to pull work from a central location, load balancers are the basic tool to push work to a cluster of computer. Take a look at Nginx and HAProxy.
  • Caches: sometimes accessing data from disk or a database is too slow, and you want to cache things in the RAM. Look at projects such as Memcached and Redis.
  • Hadoop/HDFS: Hadoop is a very spread distributed computing and distributed storage system. Knowing the basics of it is important. It is based on the MapReduce system developed at Google, and is documented in the MapReduce paper.
  • Distributed key-value stores: storing data on a single computer is easy. But what happens when a single computer is no longer enough to store all the data? You have to split your storage into two computers or more, and therefore you need mechanisms to distribute the load, replicate data, etc. Some interesting projects doing that you can look at are Cassandra and Riak.
  • Read papers and watch videos

    There is a ton of content online about large architectures and distributed systems. Read as much as you can. Sometimes the content can be very academic and full of math: if you don’t understand something, no big deal, put it aside, read about something else, and come back to it 2-3 weeks later and read again. Repeat until you understand, and as long as you keep coming at it without forcing it, you will understand eventually. Some references:

    Introductory resources

    Real-world systems and practical resources

    Theoretical content

    Build something on your own

    There are plenty of academic courses available online, but nothing replaces actually building something. It is always more interesting to apply the theory to solving real problems, because even though it’s good to know the theory on how to make perfect systems, except for life-critical applications it’s almost never necessary to build perfect systems.

    Also, you’ll learn more if you stay away from generic systems and instead focus on domain-specific systems. The more you know about the domain of the problem to solve, the more you are able to bend requirements to produce systems that are maybe not perfect, but that are simpler, and which deliver correct results within an acceptable confidence interval. For example for storage systems, most business requirements don’t need to have perfect synchronization of data across replica servers, and in most cases, business requirements are loose enough that you can get away with 1-2% or erroneous data, and sometimes even more. Academic classes online will only teach you about how to build systems that are perfect, but that are impractical to work with.

    It’s easy to bring up a dozen of servers on DigitalOcean or Amazon Web Services. At the time I’m writing this article, the smallest instance on DigitalOcean is $0.17 per day. Yes, 17 cents per day for a server. So you can bring up a cluster of 15 servers for a weekend to play with, and that will cost you only $5.

    Build whatever random thing you want to learn from, use queuing systems, NoSQL systems, caching systems, etc. Make it process lots of data, and learn from your mistakes. For example, things that come to my mind:

    • Build a system that crawls photos from a bunch of websites like the one I described above, and then have another system to create thumbnails for those images. Think about the implications of adding new thumbnail sizes and having to reprocess all images for that, having to re-crawl or having to keep the data up-to-date, having to serve the thumbnails to customers, etc.
    • Build a system that gathers metrics from various servers on the network. Metrics such as CPU activity, RAM usage, disk utilization, or any other random business-related metrics. Try using TCP and UDP, try using load balancers, etc.
    • Build a system that shards and replicate data across multiple computers. For example, you’re complete dataset is A, B, and C and it’s split across three servers: A1, B1, and C1. Then, to deal with server failure you want to replicate the data, and have exact copies of those servers in A2, B2, C2 and A3, B3, C3. Think about the failure scenarios, how you would replicate data, how you would keep the copies synced, etc.?

    Look at systems and web applications around you, and try to come up with simplified versions of them:

    • How would you store the map tiles for Google Maps?
    • How would you store the emails for Gmail?
    • How would you process images for Instagram?
    • How would you store the shopping cart for Amazon?
    • How would you connect drivers and users for Uber?

    Once you’ve build such systems, you have to think about what solutions you need to deploy new versions of your systems to production, how to gather metrics about the inner-workings and health of your systems, what type of monitoring and alerting you need, how you can run capacity tests so you can plan enough servers to survive request peaks and DDoS, etc. But those are totally different stories!

    I hope that this article helped explain how you can get started with infrastructure design and distributed systems. If you have any other resources you want to share, or if you have questions, just drop a comment below!

    To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.

You don’t need to read faster, just pick the right books

2015 December 13
by Emmanuel Goossaert


  • The non-fiction book publishers are imposing a standard that is toxic for the readers: books that are 200-300 pages long, sell for $10-20, and get 4-5 star ratings on Amazon.
  • Trying to read everything is overwhelming and unnecessary, as most non-fiction books do not deserve your time anyway. They just recycle old ideas that are available in shorter and better formats elsewhere.
  • Reading speed is a distraction from the real problem: you don’t need to read faster, just pick the right books.
  • Apply the Pareto principle, and just pick the three great books that will bring you 99% of the value you need, and focus on reading and re-reading only those.
  • Skip any paragraph or chapter that is not relevant to you. And if a book is just plain bad, drop it right away.
  • Trying to remember everything is a waste of time. Summarize what you read down to the core concepts and models. Make your brain a search engine, not a storage system.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.

Learning is awesome

Like most people, I love learning about new ideas, and a great way to that is to read non-fiction books. Books about psychology, business, history, technical topics, etc. But there are so many books out there, and so little time, it’s almost overwhelming!

So like many before me, I have researched techniques on how to read faster, but all of them ended up being no so workable for me. Reading speed doesn’t really matter for non-fiction books, as it is worthless to read a book as fast as possible if nothing is remembered. Thus another criteria to consider is retention, and how to assimilate the concepts you read about, so you can relate to them in your future thinking.

So what should we prioritize on? Read many books as fast as possible, or try to remember as much as possible from books? I’ve read a whole lot of non-fiction books over the past few years, and I’ve come to a conclusion: most of them are not even worth reading, and here why.

The current state of the book industry

The publishing industry has had a lot of time to experiment and refine their offers to find the sweet spot for their market. Let’s take an example to make it this more real. There has been a bunch of pop psychology books on the topic of expertise, here are the five most notable ones:

  • The Talent Code by Daniel Coyle, 256 pages, 4.5 stars, $16
  • Talent is Overrated by Geoff Colvin, 240 pages, 4.2 stars, $18
  • Mastery by Robert Greene, 352 pages, 4.5 stars, $12
  • Bounce by Matthew Syed, 336 pages, 4.4 stars, $11
  • Outliers by Malcolm Gladwell, 336 pages, 4.4 stars, $20

Do you notice something in that list? All those books fall in the same bucket, which is the industry standard for the average non-fiction books:

  • priced 10-20 dollars/euros
  • around 250-300 pages long
  • rated 4-5 stars on Amazon

Cal Newport wrote about how the non-fiction book publishing industry works. Non-fiction writers rarely write books and then try to sell them. In the majority of the cases they have agents who talk to publishers, and so writers pitch ideas for books to their agents who green-light them based on what they know about the current trends and what publishers want. The publishers then impose the industry standard, the infamous 250-300 pages that will sell for $10-20.

The standard of the book industry is toxic

So among those five books on expertise, which one should you read? That’s easy, the answer is none of them! They’re all closely or remotely based on the same 2007 HBR article The Making of an Expert by Anders Ericsson, itself based on earlier research by Ericsson and his peers. The HBR article is about 10 pages long, and will cost you nothing as you can read it online on the HBR website.

I’ve taken this one example of non-fiction books on the topic of expertise, and I’m sure you can easily think of other topics you know about where similar books are competing and bringing nothing new to the table. Most books are repeating what other books before them have covered, and will be forgotten 10 years from now.

And that’s why most non-fiction books feel the same, and are very repetitive. They’re basically trying to expand a 10-page article into a 250-page book by bending anecdotes with hindsight bias, all of that just to fit into an industry standard. And you as a reader and consumer end up wasting your time reading a 250-page book that really should just be 10 pages of concise and surgically-edited high quality content.

Pick a handful of books, read them, then re-read them

You don’t need to read everything, just read the good books. Following the Pareto principle, there should be 1% of the books out there that should provide you with 99% of all the value you need at this very moment of your life. Let’s pick a number, let’s say three books. There should be no more than three non-fiction books really worth reading for you right now, which are well-written, concise, and relevant to who you are and the areas where you need to grow at this very moment of your life.

So here is my advice to all of you trying to find ways to read faster or remember everything from what you read: just don’t. There is no need for that. Just try to find, among the gigantic pile of redundant books jamming the search results of Amazon, the one book that matters in a field and which will have 99% of impact on your mind, and just focus on reading that one. Assuming that this book was the best in its field, from there any additional book you read in the same field will have an exponentially decreasing impact on your mind.

How I read books at the moment

I’ve looked into various memory techniques such as the method of loci and spaced repetition, and concluded that for book it was not the good solution, because the underlying assumption and goal are wrong. Here my reasoning: if you need the knowledge every day, then the actual practice of it will make you remember it. And if you do not need the knowledge every day, then why do you even bother trying to remember it!

Here is what works for me for non-fiction books:

  • I do active reading by taking notes and writing down action points as I read.
  • In my notes, I summarize the concepts and models from each chapter.
  • If a paragraph or chapter does not seem relevant, I skim over it. If it’s not relevant at all, I just skip it.
  • Every 6-8 week period after the first reading, if I feel like I need a refresher then I re-read the book and my notes. Those re-reading always take a lot less time that the initial ones.
  • What really matters is not the actual content and anecdotes in the book, but the core concepts and models.
  • Next time I face a problem that relates to those concepts, I will know which book they came from. So essentially, I want to make my brain a search engine, not a storage system.

Of course it’s partly boring, because I’m not reading new books about some shiny new topic everybody is talking about, but who cares? I’m re-reading the few books I know are the top 1% in terms of impact on my brain, and spending the time re-reading them is the only way to ensure this great content will have a lasting impact on the way I think.

What’s next?

What’s your opinion on non-fiction books? Any reading trick or method you think I should have mentioned here? Drop a comment!

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.

Implementing a Key-Value Store – Part 9: Data Format and Memory Management in KingDB

2015 August 3

This is Part 9 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts. In this series of articles, I describe the research and process through which I am implementing a key-value database, which I have named “KingDB”. The source code is available at Please note that you do not need to read the previous parts to be able to follow. The previous parts were mostly exploratory, and starting with Part 8 is perfectly fine.

In this article, I explain how the storage engine of KingDB works, including details about the data format. I also cover how memory management is done through the use of a compaction process.

read more…

Implementing a Key-Value Store – Part 8: Architecture of KingDB

2015 May 25

This is Part 8 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts. In this series of articles, I describe the research and process through which I am implementing a key-value database, which I have named “KingDB”. The source code is available at Please note that you do not need to read the previous parts to be able to follow. The previous parts were mostly exploratory, and starting with Part 8 is perfectly fine.

In the previous articles, I have laid out the research and discussion around what needs to be considered when implementing a new key-value store. In this article, I will present the architecture of KingDB, the key-value store of this article series that I have finally finished implementing.

read more…

Introducing KingDB v0.9.0

2015 May 14
by Emmanuel Goossaert

I am pleased to announce that I am releasing the very first version of KingDB, the fast persisted key-value store. KingDB is a side-project that I have been hacking on intermittently over the last couple of years. It has taken a lot of my personal time, therefore I am very happy to finally have reached that moment.

Go to to find the source code, documentation and benchmarks.

KingDB is interesting for many reasons:

  • Fast for heavy write workloads and random reads.
  • The architecture, code, and data format are simple.
  • Multipart API to read and write large entries in smaller parts.
  • Multiple threads can access the same database safely.
  • Crash-proof: nothing ever gets overwritten.
  • Iterators and read-only consistent snapshots.
  • Compaction happens in a background thread, and does not block reads or writes.
  • The data format allows hot backups to be made.
  • Covered by unit tests.

Version 0.9.0 is still alpha code, therefore even if KingDB has many unit tests that ensure the stability of its core components, make sure you run tests in your own environment before using KingDB in production. New features and optimizations will come along the way.

Over the coming weeks I will publish the last articles for the IKVS series, which will cover the architecture and data format of KingDB.

What now? Go to to check out KingDB! If you have any questions, drop a comment below or join the KingDB mailing list, I would be happy to answer them :)

An afternoon in 1983 with Hector

2014 November 22



Back in 1998, I was 13 years old and had no money to buy a computer. But that wouldn’t stop me. I didn’t care if I had a brand new computer or a used one, all I wanted was a computer. From the conversations of adults around me, I heard it was common for businesses to renew their hardware and throw their old computers into the trash. So I figured, all I had to do was to be in the right trash at the right time, or even better, make the trash come to me. In these years, there were tons of computer paper magazines, so I decided to send a letter to one of them, of which I have now forgotten the name, to publish an ad in their classified section: “13 year-old student in the Paris area, will come to your home or office to get any computers you are about to throw in the bin”. From this classified, I got a 486DX-80 PC, a printer, and a Hector HR2+ computer.

The 486DX-80 and the printer have served me well, and are long gone. Last summer, while visiting my family in France I decided to take a look into the attic. As I was making my way through rusty nails and spider webs, I noticed a bag covered with dust in a dark corner of the room. I had the feeling that I was about to make a very good discovery, and I was not disappointed. I opened the bag with excitement, and there it was, the almighty Hector 2HR+ computer!

In the bag came all the booklets, cables, and cassettes, so I decided that I would spend the afternoon writing code on that machine, and that I would run this code before nightfall. But before I go any further on that, a bit of history about the Hector 2HR+…

read more…

Implementing a Key-Value Store – Part 6: Open-Addressing Hash Tables

2014 May 7

This is Part 6 of the IKVS series, “Implementing a Key-Value Store”. You can also check the Table of Contents for other parts.

In this article, I will compare several open-addressing hash tables: Linear Probing, Hopscotch hashing, and Robin Hood hashing. I have already done some work on this topic, and in this article I want to gather data for more metrics in order to decide which hash table I will use for my key-value store.

The result section also contains an interesting observation about the maximum DIB for Robin Hood hashing, which originated from Kristofer Karlsson, a software engineer at Spotify and the author of the key-value store Sparkey.

This article will cover:

1. Open-addressing hash tables
2. Metrics
3. Experimental Protocol
4. Results and Discussion
5. Conclusion
6. References

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.
As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!

read more…

Coding for SSDs – Part 6: A Summary – What every programmer should know about solid-state drives

2014 February 12

This is Part 6 over 6 of “Coding for SSDs”. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs.

In this part, I am summarizing the content from all the other parts in the form of concise self-contained paragraphs. Each paragraph is referencing one or more sections of the other parts, which allow to get more detailed information regarding each topic.

To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.
As usual, comments are open at the bottom of this post, and I am always happy to welcome questions, corrections and contributions!


1. Memory cell types

A solid-state drives (SSD) is a flash-memory based data storage device. Bits are stored into cells, which exist in three types: 1 bit per cell (single level cell, SLC), 2 bits per cell (multiple level cell, MLC), 3 bits per cell (triple-level cell, TLC).

See also: Section 1.1

2. Limited lifespan

Each cell has a maximum number of P/E cycles (Program/Erase), after which the cell is considered defective. This means that NAND-flash memory wears off and has a limited lifespan.

See also: Section 1.1

3. Benchmarking is hard

Testers are humans, therefore not all benchmarks are exempt of errors. Be careful when reading the benchmarks from manufacturers or third parties, and use multiple sources before trusting any numbers. Whenever possible, run your own in-house benchmarking using the specific workload of your system, along with the specific SSD model that you want to use. Finally, make sure you look at the performance metrics that matter most for the system at hand.

See also: Sections 2.2 and 2.3

Pages and blocks

4. NAND-flash pages and blocks

Cells are grouped into a grid, called a block, and blocks are grouped into planes. The smallest unit through which a block can be read or written is a page. Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. For example, the Samsung SSD 840 EVO has blocks of size 2048 KB, and each block contains 256 pages of 8 KB each.

See also: Section 3.2

5. Reads are aligned on page size

It is not possible to read less than one page at once. One can of course only request just one byte from the operating system, but a full page will be retrieved in the SSD, forcing a lot more data to be read than necessary.

See also: Section 3.2

6. Writes are aligned on page size

When writing to an SSD, writes happen by increments of the page size. So even if a write operation affects only one byte, a whole page will be written anyway. Writing more data than necessary is known as write amplification. Writing to a page is also called “to program” a page.

See also: Section 3.2

7. Pages cannot be overwritten

A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”. The data is not updated in-place, as the “free” page is a different page than the page that originally contained the data. Once the data is persisted to the drive, the original page is marked as being “stale”, and will remain as such until it is erased.

See also: Section 3.2

8. Erases are aligned on block size

Pages cannot be overwritten, and once they become stale, the only way to make them free again is to erase them. However, it is not possible to erase individual pages, and it is only possible to erase whole blocks at once.

See also: Section 3.2

SSD controller and internals

9. Flash Translation Layer

The Flash Translation Layer (FTL) is a component of the SSD controller which maps Logical Block Addresses (LBA) from the host to Physical Block Addresses (PBA) on the drive. Most recent drives implement an approach called “hybrid log-block mapping” or one of its derivatives, which works in a way that is similar to log-structured file systems. This allows random writes to be handled like sequential writes.

See also: Section 4.2

10. Internal parallelism

Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”.

See also: Section 6

11. Wear leveling

Because NAND-flash cells are wearing off, one of the main goals of the FTL is to distribute the work among cells as evenly as possible so that blocks will reach their P/E cycle limit and wear off at the same time.

See also: Section 3.4

12. Garbage collection

The garbage collection process in the SSD controller ensures that “stale” pages are erased and restored into a “free” state so that the incoming write commands can be processed.

See also: Section 4.4

13. Background operations can affect foreground operations

Background operations such as garbage collection can impact negatively on foreground operations from the host, especially in the case of a sustained workload of small random writes.

See also: Section 4.4

Access patterns

14. Never write less than a page

Avoid writing chunks of data that are below the size of a NAND-flash page to minimize write amplification and prevent read-modify-write operations. The largest size for a page at the moment is 16 KB, therefore it is the value that should be used by default. This size depends on the SSD models and you may need to increase it in the future as SSDs improve.

See also: Sections 3.2 and 3.3

15. Align writes

Align writes on the page size, and write chunks of data that are multiple of the page size.

See also: Sections 3.2 and 3.3

16. Buffer small writes

To maximize throughput, whenever possible keep small writes into a buffer in RAM and when the buffer is full, perform a single large write to batch all the small writes.

See also: Sections 3.2 and 3.3

17. To improve the read performance, write related data together

Read performance is a consequence of the write pattern. When a large chunk of data is written at once, it is spread across separate NAND-flash chips. Thus you should write related data in the same page, block, or clustered block, so it can later be read faster with a single I/O request, by taking advantage of the internal parallelism.

See also: Section 7.3

18. Separate read and write requests

A workload made of a mix of small interleaved reads and writes will prevent the internal caching and readahead mechanism to work properly, and will cause the throughput to drop. It is best to avoid simultaneous reads and writes, and perform them one after the other in large chunks, preferably of the size of the clustered block. For example, if 1000 files have to be updated, you could iterate over the files, doing a read and write on a file and then moving to the next file, but that would be slow. It would be better to reads all 1000 files at once and then write back to those 1000 files at once.

See also: Section 7.4

19. Invalidate obsolete data in batch

When some data is no longer needed or need to be deleted, it is better to wait and invalidate it in a large batches in a single operation. This will allow the garbage collector process to handle larger areas at once and will help minimizing internal fragmentation.

See also: Section 4.4

20. Random writes are not always slower than sequential writes

If the writes are small (i.e. below the size of the clustered block), then random writes are slower than sequential writes.
If writes are both multiple of and aligned to the size of a clustered block, the random writes will use all the available levels of internal parallelism, and will perform just as well as sequential writes. For most drives, the clustered block has a size of 16 MB or 32 MB, therefore it is safe to use 32 MB.

See also: Section 7.2

21. A large single-threaded read is better than many small concurrent reads

Concurrent random reads cannot fully make use of the readahead mechanism. In addition, multiple Logical Block Addresses may end up on the same chip, not taking advantage or of the internal parallelism. A large read operation will access sequential addresses and will therefore be able to use the readahead buffer if present, and use the internal parallelism. Consequently if the use case allows it, it is better to issue a large read request.

See also: Section 7.3

22. A large single-threaded write is better than many small concurrent writes

A large single-threaded write request offers the same throughput as many small concurrent writes, however in terms of latency, a large single write has a better response time than concurrent writes. Therefore, whenever possible, it is best to perform single-threaded large writes.

See also: Section 7.2

23. When the writes are small and cannot be grouped or buffered, multi-threading is beneficial

Many concurrent small write requests will offer a better throughput than a single small write request. So if the I/O is small and cannot be batched, it is better to use multiple threads.

See also: Section 7.2

24. Split cold and hot data

Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easier.

See also: Section 4.4

25. Buffer hot data

Extremely hot data and other high-change metadata should be buffered as much as possible and written to the drive as infrequently as possible.

See also: Section 4.4

System optimizations

26. PCI Express and SAS are faster than SATA

The two main host interfaces offered by manufacturers are SATA 3.0 (550 MB/s) and PCI Express 3.0 (1 GB/s per lane, using multiple lanes). Serial Attached SCSI (SAS) is also available for enterprise SSDs. In their latest versions, PCI Express and SAS are faster than SATA, but they are also more expensive.

See also: Section 2.1

27. Over-provisioning is useful for wear leveling and performance

A drive can be over-provisioned simply by formatting it to a logical partition capacity smaller than the maximum physical capacity. The remaining space, invisible to the user, will still be visible and used by the SSD controller. Over-provisioning helps the wear leveling mechanisms to cope with the inherent limited lifespan of NAND-flash cells. For workloads in which writes are not so heavy, 10% to 15% of over-provisioning is enough. For workloads of sustained random writes, keeping up to 25% of over-provisioning will improve performance. The over-provisioning will act as a buffer of NAND-flash blocks, helping the garbage collection process to absorb peaks of writes.

See also: Section 5.2

Enable the TRIM command

Make sure your kernel and filesystem support the TRIM command. The TRIM command notifies the SSD controller when a block is deleted. The garbage collection process can then erase blocks in background during idle times, preparing the drive to face large writes workloads.

See also: Section 5.1

29. Align the partition

To ensure that logical writes are truly aligned to the physical memory, you must align the partition to the NAND-flash page size of the drive.

See also: Section 8.1


This summary concludes the “Coding for SSDs” article series. I hope that I was able to convey in an understandable manner what I have learned during my personal research over solid-state drives.

If after reading this series of articles you want to go more in-depth about SSDs, a good first step would be to read some of the publications and articles linked in the reference sections of Part 2 to 5.

Another great resource is the FAST conference (the USENIX Conference on File and Storage Technologies). A lot of excellent research is being presented there every year. I highly recommend their website, a good starting point being the videos and publications for FAST 2013.


This article was translated to Simplified Chinese by Xiong Duo.

Coding for SSDs – Part 5: Access Patterns and System Optimizations

2014 February 12

This is Part 5 over 6 of “Coding for SSDs”, covering Sections 7 and 8. For other parts and sections, you can refer to the Table to Contents. This is a series of articles that I wrote to share what I learned while documenting myself on SSDs, and on how to make code perform well on SSDs. If you’re in a rush, you can also go directly to Part 6, which is summarizing the content from all the other parts.

Now that I have covered most of the inner workings of solid-state drives in the previous sections, I can provide data that will help build an understanding of which access patterns should be used and why they are indeed better than others. In this part, I explain how writes should be done, how reads should be done, and why concurrent read and write operations are interfering. I also cover a few optimizations at the level of the filesystem which can improve performance.




read more…