Most of us developers have had experience with web or native applications that run on a single computer, but things are a lot different when you need to build a distributed system to synchronize dozens, sometimes hundreds of computers to work together.
I recently received an email from someone asking me how to get started with infrastructure design, and I thought that I would share what I wrote him in a blog post if that can help more people who want to get started in that as well.
To receive a notification email every time a new article is posted on Code Capsule, you can subscribe to the newsletter by filling up the form at the top right corner of the blog.
A basic example: a distributed web crawler
For multiple computers to work together, you need some sort of synchronization mechanisms. The most basic ones are databases and queues. Part of your computers are producers or masters, and another part are consumers or workers. The producers write data in a database, or enqueue jobs in a queue, and the consumers read the database or queue. The database or queue system runs on a single computer, with some locking, which guarantees that the workers don’t pick the same work or data to process.
Let’s take an example. Imagine you want to implement a web crawler that downloads web pages along with their images. One possible design for such a system will require the following components:
- Queue: the queue contains the URLs to be crawled. Processes can add URLs to the queue, and workers can pick up URLs to download from the queue.
- Crawlers: the crawlers pick URLs from the queue, either web pages or images, and download them. If a URL is a webpage, the crawlers also look for links in the page, and push all those links to the queue for other crawlers to pick them up. The crawlers are at the same time the producers and the consumers.
- File storage: The file storage stores the web pages and images in an efficient manner.
- Metadata: a database, either MySQL-like, Redis-like, or any other key-value store, will keep track of which URL has been downloaded already, and if so where it is stored locally.
The queue and the crawlers are their own sub-systems, they communicate with external web servers on the internet, with the metadata database, and with the file storage system. The file storage and metadata database are also their own sub-systems.
Figure 1 below shows how we can put all the sub-systems together to have a basic distributed web crawler. Here is how it works:
1. A crawler gets a URL from the queue.
2. The crawler checks in the database if the URL was already downloaded. If so, just drop it.
3. The crawler enqueues the URLs of all links and images in the page.
4. If the URL was not downloaded recently, get the latest version from the web server.
5. The crawler saves the file to the File Storage system: it talks to a reserse proxy that’s taking incoming requests and dispatching them to storage nodes.
6. The File Storage distributes load and replicates data across multiple servers.
7. The File Storage update the metadata database so we know which local file is storing which URL.
Figure 1: Architecture of a basic distributed web crawler
The advantage of a design like the one above is that you can scale up independently each sub-system. For example, if you need to crawl stuff faster, just add more crawlers. Maybe at some point you’ll have too many crawlers and you’ll need to split the queue into multiple queues. Or maybe you realize that you have to store more images than anticipated, so just add a few more storage nodes to your file storage system. If the metadata is becoming too much of a centralized point of contention, turn it into a distributed storage, use something like Cassandra or Riak for that. You get the idea.
And what I have presented above is just one way to build a simple crawler. There is no right or wrong way, only what works and what doesn’t work, considering the business requirements.
Talk to people who are doing it
The one unique way to truly learn how to build a distributed system is to maintain or build one, or work with someone who has built something big before. But obviously, if the company you’re currently working at does not have the scale or need for such a thing, then my advice is pretty useless…
Go to meetup.com and find groups in your geographic area that talk about using NoSQL data storage systems, Big Data systems, etc. In those groups, identify the people who are working on large-scale systems and ask them questions about the problems they have and how they solve them. This is by far the most valuable thing you can do.
There are a few basic concepts and tools that you need to know about, some sort of alphabet of distributed systems that you can later on pick from and combine to build systems:
- Concepts of distributed systems: read a bit about the basic concepts in the field of Distributed Systems, such as consensus algorithms, consistent hashing, consistency, availability and partition tolerance.
- RDBMs: relational database management systems, such as MySQL or PostgreSQL. RDMBs are one of the most significant invention of humankind from the last few decades. They’re like Excel spreadsheets on steroid. If you’re reading this article I’m assuming you’re a programmer and you’ve already worked with relational databases. If not, go read about MySQL or PostgreSQL right away! A good resource for that is the web site http://use-the-index-luke.com/
- Queues: queues are the simplest way to distribute work among a cluster of computers. There are some specific projects tackling the problem, such as RabbitMQ or ActiveMQ, and sometimes people just use a table in a good old database to implement a queue. Whatever works!
- Load balancers: if queues are the basic mechanism for a cluster of computer to pull work from a central location, load balancers are the basic tool to push work to a cluster of computer. Take a look at Nginx and HAProxy.
- Caches: sometimes accessing data from disk or a database is too slow, and you want to cache things in the RAM. Look at projects such as Memcached and Redis.
- Hadoop/HDFS: Hadoop is a very spread distributed computing and distributed storage system. Knowing the basics of it is important. It is based on the MapReduce system developed at Google, and is documented in the MapReduce paper.
- Distributed key-value stores: storing data on a single computer is easy. But what happens when a single computer is no longer enough to store all the data? You have to split your storage into two computers or more, and therefore you need mechanisms to distribute the load, replicate data, etc. Some interesting projects doing that you can look at are Cassandra and Riak.
Read papers and watch videos
There is a ton of content online about large architectures and distributed systems. Read as much as you can. Sometimes the content can be very academic and full of math: if you don’t understand something, no big deal, put it aside, read about something else, and come back to it 2-3 weeks later and read again. Repeat until you understand, and as long as you keep coming at it without forcing it, you will understand eventually. Some references:
Real-world systems and practical resources
Build something on your own
There are plenty of academic courses available online, but nothing replaces actually building something. It is always more interesting to apply the theory to solving real problems, because even though it’s good to know the theory on how to make perfect systems, except for life-critical applications it’s almost never necessary to build perfect systems.
Also, you’ll learn more if you stay away from generic systems and instead focus on domain-specific systems. The more you know about the domain of the problem to solve, the more you are able to bend requirements to produce systems that are maybe not perfect, but that are simpler, and which deliver correct results within an acceptable confidence interval. For example for storage systems, most business requirements don’t need to have perfect synchronization of data across replica servers, and in most cases, business requirements are loose enough that you can get away with 1-2% or erroneous data, and sometimes even more. Academic classes online will only teach you about how to build systems that are perfect, but that are impractical to work with.
It’s easy to bring up a dozen of servers on DigitalOcean or Amazon Web Services. At the time I’m writing this article, the smallest instance on DigitalOcean is $0.17 per day. Yes, 17 cents per day for a server. So you can bring up a cluster of 15 servers for a weekend to play with, and that will cost you only $5.
Build whatever random thing you want to learn from, use queuing systems, NoSQL systems, caching systems, etc. Make it process lots of data, and learn from your mistakes. For example, things that come to my mind:
- Build a system that crawls photos from a bunch of websites like the one I described above, and then have another system to create thumbnails for those images. Think about the implications of adding new thumbnail sizes and having to reprocess all images for that, having to re-crawl or having to keep the data up-to-date, having to serve the thumbnails to customers, etc.
- Build a system that gathers metrics from various servers on the network. Metrics such as CPU activity, RAM usage, disk utilization, or any other random business-related metrics. Try using TCP and UDP, try using load balancers, etc.
- Build a system that shards and replicate data across multiple computers. For example, you’re complete dataset is A, B, and C and it’s split across three servers: A1, B1, and C1. Then, to deal with server failure you want to replicate the data, and have exact copies of those servers in A2, B2, C2 and A3, B3, C3. Think about the failure scenarios, how you would replicate data, how you would keep the copies synced, etc.?
Look at systems and web applications around you, and try to come up with simplified versions of them:
- How would you store the map tiles for Google Maps?
- How would you store the emails for Gmail?
- How would you process images for Instagram?
- How would you store the shopping cart for Amazon?
- How would you connect drivers and users for Uber?
Once you’ve build such systems, you have to think about what solutions you need to deploy new versions of your systems to production, how to gather metrics about the inner-workings and health of your systems, what type of monitoring and alerting you need, how you can run capacity tests so you can plan enough servers to survive request peaks and DDoS, etc. But those are totally different stories!