Black boar


Designing Data-Intensive Applications is a book that is receiving a lot of praise recently. The book helps the reader to dive deep (way beyond the CAP Theorem) in a world that rapidly changed over the past years. It aims to give you insight into core concepts and ideas in system design. After reading, you are then able to compare systems and approaches and find something that suits your application. The book is also extremely helpful, always providing an extensive list of references at the end of the chapters. In that manner, it encourages the reader to read more into topics he/she finds interesting.

Martin Kleppman is the author of the book and senior researcher in distributed systems at the University of Cambridge. He divides the book into three parts:

  1. Foundations of Data Systems
  2. Distributed Data
  3. Derived Data

In this first post, I will try to summarize the first part of the book. Please take this with a grain of salt. As a summary, it may sometimes not read naturally but will give you an overview so you can dive deeper (and read the book) yourself. You can find more information on where to buy it in its official site.

Designing Data Intensive Applications book cover

And since we are talking about distributed systems I’ll drop two other extremely important resources you can use while studying:

  1. High Scalability is an awesome site on… scalability! You should definitely add it to your RSS feed.
  2. System Design Primer is a repository that gives you an overview of distributed systems. It also provides you with answered questions commonly asked by big tech companies.

Reliable, Scalable and Maintainable Applications

Standard Building Blocks

  • Databases: store data so that they, or another application, can find it again later.
  • Caches: remember the result of an expensive operation, to speed up reads.
  • Search Indexes: allow users to search data by keyword or filter it in various ways
  • Batch Processors: periodically crunch a large amount of accumulated data.

Tools don’t necessarily fit neatly in these use cases. For example, Apache Kafka is a message queue with database-like durability and Redis is an in-memory data store that can also be used as a message queue.

The 3 Concerns

  • Reliability: the system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware, software failures or human errors).
  • Scalability: means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we need ways of describing load and performance quantitatively.
  • Maintainability: over time, many different people will work on the system, and they should be able to work on it productively. Good abstractions can help reduce complexity and make the systems easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health and having effective ways of managing it.

Faults and Failures

Fault is not the same as a failure. A fault is usually defined as one component of the system deviating from the spec, whereas a failure is when the system as a whole stops providing the required service to the user. So to reduce the number of failures it is important to reduce the number of faults, i.e. design fault-tolerance mechanisms.

There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production.

Describing Load

Load can be described with a few numbers which we call load parameters. The best choice of parameters depend on the architecture of your system: it may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else.

Describing Performance

In a batch processing system such as Hadoop, we usually care about throughout - the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size. In online systems what’s usually more important is the service’s response time - i.e., the time between a client sending a request and receiving a response.

A good metric to describe response times are the tail latencies, specifically the p999 (or 99.9th percentile), as Amazon found out.

Data Models and Query Languages

Relational Model

In the relational data model data is organized into relations and tuples. Relation is an unordered set that contains the relationship of attributes that represent entities. The database management system can organize it in any way it wants, allowing optimization. A tuple is a set of attribute values in the relation. In SQL, relations are the table and tuples are rows in the table.

The Birth of NoSQL

There are several driving forces behind the adoption of NoSQL databases, including:

  • A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput
  • A widespread preference for free and open-source software over commercial database products
  • Specialized query operations that are not well supported by the relational model
  • Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model.

Polyglot persistence: when storing data, it is best to use multiple data storage technologies, chosen based upon the way data is being used by individual applications or components of a single application.

The Object-Relational Mismatch

If data is stored in relational tables, often an awkward translation layer is required between objects in the application code and the database model of tables, rows and columns (what is sometimes called an impedance mismatch). Sometimes it is better to represent the object as a simple JSON, in which it may be better to use a Document-oriented database like MongoDB, RethinkDB, CouchDB or Espresso.

Many-to-One and Many-to-Many Relationships

To avoid data duplication and redundant copies in the database we might want to save properties as identifiers, instead of the whole different objects. That way, the identifier might point us to a single instance, making the number of write overheads and inconsistencies smaller. That is the key idea behind normalization in databases.

Unfortunately, normalizing these data require many-to-one relationships (many people live in one particular region, many people work in one particular industry), which don’t fit nicely into the document model. In relational databases, it’s normal to refer to rows in other tables by ID, because joins are easy. In document databases support for joins is often weak.

Relational Versus Document Databases

The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is close to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships. Document databases are normally schema-on-read (the structure of the data is implicit, and only interpreted when data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it).

Say for example you are storing full names of your users in a field and want to split it into two different fields. In a document database, you would just start writing new documents with the new fields and have code in the application that handles the case when old documents are read.

if (user && && !user.first_name) {
    // Documents written before Dec 8, 2013 don't have first_name
    user.first_name =" ")[0];

On the other hand, in a “statically typed” database schema, you would typically perform a migration along the lines of:

ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = substring_index(name, ' ', 1);

And schema changes are often slow and require downtime, especially in MySQL.

A document is usually stored as a single continuous blob. If your application often needs to access the entire document, there is a performance advantage to this storage locality. If data is split across multiple tables, multiple index lookups are required to retrieve it all, which may take more time.

It is worth pointing out that the idea of grouping related data together for locality purposes is not limited to the document model. For example, Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s row should be interleaved within a parent table. Oracle allows the same, using a feature called multi-table index cluster tables. The column-family concept in the Bigtable data model (used in Cassandra and HBase) has a similar purpose of managing locality.

Query Languages for Data

SQL is a declarative language, that is, it specifies what needs to be done rather than how to do it. Most commonly used programming languages are imperative, where it is needed to describe exactly how to do a certain task.

In a declarative query language, like SQL, you just specify the pattern of the data you want and then, it is up to the database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query. Imperative code is very hard to parallelize, making them a not so good fit to query languages as CPUs are getting fast by adding more cores and not by running at higher clock speeds.

MapReduce Querying

MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google. The map function is called once for every entity mapping a value to a key. After it the reduce performs an operation in all values with the same key.

Graph-Like Data Models

A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs). A powerful use of graphs is to provide a consistent way of storing completely different types of objects in a single data store. For example, Facebook maintains a single graph with many different types of vertices and edges: vertices represent people, locations, events, check-ins, and comments made by users; edges indicate which people are friends with each other, which checking happened in which location, who commented on which post, and so on.

Two types of graphs models will be discussed here: the property graph model (implemented by Neo4j, Titan and InfiniteGraph) and the triple-store model (implemented by Datomic, AllegroGraph)

Property Graphs

Building blocks of the property graph model

Building blocks of the property graph model

In the property graph model, each vertex consists of:

  • a unique id,
  • a set of outgoing edges,
  • a set of incoming edges,
  • a collection of properties (key-value pairs).

Each edge consists of:

  • a unique id,
  • the vertex at which the edge starts (the tail vertex),
  • the vertex at which the edge ends (the head vertex),
  • a label to describe the kind of relationship between two vertices,
  • a collection of properties (key-value pairs).


In a triple-store, all information is stored in the form of very simple three-part statements: (subjects, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and “bananas” is the object.

The subject of a triple is equivalent to a vertex in a graph. The object is one of two things:

  1. A value in a primitive datatype
  2. Another vertex in the graph

Storage and Retrieval

Data Structures that Power your Database

You could use a simple append-only log as your database and they are really useful, but they have terrible performance if you want to look up a key in it (O(n)). In order to efficiently find the value for a particular key in the database, we need a different data structure: an index.

An index is an additional structure that is derived from the primary data and maintaining additional structures incurs overhead, especially on writes. And that’s the principal trade-off, no write beats simply appending to a file.

Hash Indexes

Key-value stores are quite similar to the dictionary type that is usually implemented as a hash map. One simple example is keeping a hash map in memory that maps a key to the byte offset in our append-only log.

But our files could get quite big, and a good solution is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file. We can them perform compaction - throw away duplicate keys in the log keeping only the most recent ones - on these segments. Each segment, then, would need to have its own in-memory hash table.

Some problems arise naturally: the hash tables must fit in memory and range queries are not efficient, you would need to look up every single key individually in all hash maps.

SSTables and LSM-Trees

The SSTable (short for Sorted String Table) is similar to the append-only log, but now the key-value pairs are sorted by key. This means that adding a new entry to the database is not only appending.

With this structure, you no longer need to keep an index of all the keys in memory, as they are sorted you would only need a few. That way, if you look for the keyword handiwork, but it is not in your hashmap, you can use the offsets for the keys handbag and handsome and know that your key would be between them. Since read requests need to scan over several key-value pairs anyway, it is possible to group those records into a block and compress it before writing it to disk.

Fine. But how do we write to SSTables?

  • When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree) - sometimes called a memtable.
  • When the memtable gets big enough - typically a few MiB - write it to disk as an SSTable file.
  • When a read comes, try to find in the memtable, then start looking in the most recents SSTables.
  • Keep workers merging, compacting, discarding overwritten values and deleting values in segment files.

Some problems also arise:

  • if the database crashes we would lose everything that is in memory. To solve this just keep an append-only file. Every-time a write comes in, append it to the log just as in the previous approach. When the memtable is made persistent, discard its log.
  • Want to look for a key that was not inserted? You will end up looking for all keys in a segment. To optimize this storage engines often use additional Bloom Filters - a probabilistic data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.


B-Trees also keep key-value pairs sorted by key and break the database down into fixed-size blocks or pages, and read or write one page at a time. Each page can be identified using and address or location, which allows one page to refer to another - similar to a pointer, but on disk instead of in memory. We can use these page references to construct a tree of pages.


When you want to look up a key in the index you start at the page that is designated as the root of the B-tree. This page contains several keys and references to child pages. Each child is responsible for a continuous range of keys, and the keys between the references indicate where the boundaries between those ranges lie.

The branching factor is the number of references to child pages in one page. In the image above, for example, the branching factor would be 5. To insert a new key you first find the page that corresponds to that key range and adds it. If there is not enough space you split the page into two half-full pages, and then add it to the page accordingly.

This way the B-tree remains balanced (a tree with n keys always has a depth of log(n)). They also need a small number of levels as the size of the B-Tree grows exponentially. A 4-level B-Tree of 4KB pages with branching factor of 500 can store up to 256TB).

Splitting the pages to add a new key is a dangerous operation, as the database could crash in the middle of the process. To make it resilient, it is common to include a write-ahead log (WAL): an append-only file that receives the modifications before the tree itself.

Comparing B-Trees and LSM-Trees

LSM-Trees are usually faster for writes, whereas B-Trees are thought to be faster for reads. If write throughput is high in LSM-Trees and compaction is not configured carefully, it can happen that compaction cannot keep up with the rate of incoming writes, making the number of segments grow in disk indefinitely.

Key in B-Trees exists only in one place, whereas in a log-structured storage engine we may have multiple copies of the same key waiting to be merged.

Databases for Analytics

Databases started being increasingly used for data analytics, which has very different access patterns. Usually, an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates statistics (such as count, sum or average). We can differentiate the usual usage pattern from analytics pattern with the following naming:

  • Online Transaction Processing (OLTP)
  • Online Analytic Processing (OLAP)

OLTP systems are typically user-facing, and therefore they may see a huge volume of requests. Disk seek time is often the bottleneck here. OLAP systems handle a much lower volume of queries, but each query is typically very demanding. The bottleneck is often disk bandwidth, as it has to read/write a huge amount of data.

Companies, realizing this difference, started to run the analytics on a separate database, called data warehouse. The data model of a data warehouse is most commonly relational because SQL is generally a good fit for analytic queries.

Data warehouses often are very wide: they can have over 100 columns and can have trillions of rows. Storying and querying them efficiently becomes a challenging problem. It is common, however, that queries access a few columns at a time. In most OLTP databases, storage is laid out in a row-oriented fashion. There comes a different approach: column-oriented storage*: don’t store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work.

Encoding and Evolution

Regularly features changes are required that also requires a change to data in stores, and a corresponding change to application code often needs to happen. Because of user acceptance on the client-side and rolling upgrades on the server-side old and new versions of the new and old codes and data formats may potentially coexist, so we need to maintain backward and forward compatibility.

Formats for Encoding Data

Programs usually work with data in at least two different representations: in memory and data encoded (or serialized) to be sent over the network. There are some popular human-readable encoding formats such as JSON, XML, and CSV and they are good enough for many purposes, but they can get quite verbose and space consuming. As an alternative, they have evolutions which are binary-encoded (such as BSON used by MongoDB), but some may argue that they provide a small space reduction as opposed to human-readability, which as a nice feature to have. So some new encoding formats may arise.

Thrift and Protocol Buffers

Both Thrift and Protocol Buffers require a schema for any data that is encoded. They improve on the previous encoding by packing a field type number as opposed to the entire number of a field and by also using variable-length encoding for numbers.


Avro binary encoding in most cases more core compact than the previous encodings. The encoding simply consists of values concatenated together. A string is just a length prefix followed by UTF-8 bytes, but there’s nothing in the encoded data that tells you that it is a string. So, to parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the data type of each field. So the binary data can only be decoded correctly if it is exactly in the same schema as the code that wrote it.

The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same, they only need to be compatible. Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translates the data from the writer’s to the reader’s schema.

Modes of Dataflow

There are mainly three modes of dataflow:

  • Databases, where the process writing to the database encodes the data and the process reading from the database decodes it.
  • RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response
  • Asynchronous message passing (using message brokers or actors), where nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient.

Asynchronous message passing through message broker has several advantages compared to direct RPC:

  • It can act as a buffer if the recipient is unavailable or overloaded
  • It can automatically redeliver messages to a process that has crashed and thus prevent messages from being lost.
  • It allows message broadcasting.
  • It avoids the sender needing to know specifics about the recipient (the sender just publishes messages and doesn’t care who consumes them)

Brokers are generally used as follows: one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic. There can be many producers and/or consumers on the same topic.


That’s it for the first part. If you are reading from the future you can probably find parts 2 and 3 in the blog. Until then!