Why replication? A story of single-leader / master-slave / active-passive replication in distributed systems.

Introduction

This article is also available on YouTube:

Replication in distributed systems occurs when each piece of data has more than one copy and each copy is located on a separate node.

There are a few reasons to adopt replication:

  • To achieve the data redundancy and therefore allow a system to continue working in case of some nodes go down
  • To get the data geographically as close to end users as possible
  • To increase the reading throughput and reduce read latencies by increasing the number of storage nodes horizontally
Scenario #1: Data redundancy
Scenario #2: Geo co-location
Scenario #3: Reducing read latency

At a glance, the implementation of replication looks fairly simple: just distribute the same data across a few nodes and harvest the advantages. That would be true if the data wouldn’t change once it was written.

Any data in real systems has a tendency to evolve with time, and frequently the same data can be updated simultaneously. There are some significant challenges with updates because replicas are spread across several nodes connected by a network.

There are several compromises and restrictions in replication. The most often asked issues concern handling of failed replicas, synchronous and asynchronous replication, handling of conflicts, and other topics. Since some of the problems are also widespread, I prefer to make comparisons with concurrency in conventional programming languages.

Today, we’ll examine single-leader, the most used replication technique.

Leaders and followers

Typically, distributed systems cannot be viewed as a single entity and are frequently seen from the perspective of a cluster. A cluster is a group of nodes that do some work together. There are different roles in a cluster, and nodes are not even in terms of their responsibilities.

Each node that stores a copy of the dataset is called a replica. Replicas work together and take care of serving read and write requests across the cluster. The main question here is: How do we ensure that all data will eventually be distributed to all replicas once a write request is finished?

The replicas should handle each write to the database to guarantee that each replica has the same data. Leader-based replication, also known as active/passive or master-slave replication, became the most popular approach to solving this problem.

The single-leader approach manages write requests in the manner described below:

One replica is chosen to act as the leader, sometimes referred as the master node. Clients should make requests to the leader if they want to write to the database, and the leader node will write the new data to its local storage.

Followers are the other replicas, sometimes called slaves or hot standbys. The updates are sent to all of the followers as part of a replication log each time the leader adds new data to its internal storage. By applying all writes in the same order as they were managed on the leader, each follower node reads the leader’s change stream and updates its local copy of the database correspondingly.

Clients can query either the leader or any of the followers when they want to read data from the database. The followers, on the other hand, are read-only from the client’s side since only the leader is capable of writing.

Read/Write path in the single-leader replication

Leader-based replication was adopted in many systems: PostgreSQL, MySQL, MongoDB, Apache Kafka, and others.

Synchronous and asynchronous replication

The next question that arises is, “How do I transfer data changes from the leader to followers”? Apparently, there are two options available: we can send the updates either in a synchronous or asynchronous manner.

Consider a scenario when I have decided to change my GitHub nickname. Let’s observe what will happen in the master-slave cluster with three nodes that uses synchronous replication:

Fully synchronous leader-based replication flow

As you can see from the picture above, a fully synchronous replication between the leader and all followers is a time-consuming process. The leader must wait for acknowledgements from each replica before making a successful response to the client. The approach may seem too complicated and blocking, and it is rarely used in real systems.

A more common approach is to use a mixture of synchronous and asynchronous replication:

Semi-synchronous leader-based replication flow

The second scenario takes less time to complete since the leader doesn’t wait for the second replica to apply changes.

Replication typically happens relatively quickly, with the majority of databases applying changes to followers in a fraction of a second. But the amount of time it could take is not guaranteed. Under some conditions, followers may fall behind the leader by several minutes or even more.

Synchronous replication’s key benefit is the assurance that the follower will always have an up-to-date copy of the data that is identical to the leader’s copy. We can be certain that the follower’s data will still be available if the leader fails unexpectedly. The disadvantage is that the write cannot be executed if the synchronous follower does not respond. The leader must stop all writing and wait until the synchronous replica comes back online.

All followers cannot be synchronous since a single node failure would cause the system to come to a complete halt. One of the followers is frequently synchronous while the others are asynchronous, and if the synchronous follower slows down or becomes unstable, one of the asynchronous followers is changed to synchronous. This guarantees that the most recent copy of the data is present on at least two nodes in a cluster.

Semi-synchronous leader-based cluster

In some cases, leader-based replication may be configured in a fully asynchronous manner. That may lead to some trade-offs and caveats, since if the leader fails, all writes that were not replicated to followers will be lost. So writes are not durable. The trade-off is that the system remains available even if all replicas are offline. That looks like a fair trade-off if your cluster has many replicas or the cluster is distributed across the world.

Scaling the cluster

Clusters evolve with time, and sometimes you may need to increase the number of replicas to serve more read requests or to move the data closer to users in a new region.

It looks simple at a glance — just copy all the data from the leader to a new replica node. But keep in mind that clients will keep writing new data to the cluster, and you can’t block the writes in the cluster each time you want to add a new node.

To avoid downtimes databases usually create snapshots that reflect the consistent state of the leader at a certain point in time. When a new replica is added to the cluster, it processes the snapshot and becomes up-to-date with the leader at some historical point. After that, the replica asks the leader for the updates that happened after the snapshot was created. After the backlog is processed by the replica, it becomes in-sync with the leader and starts to serve read requests.

Conclusions

Distributed systems use various approaches that guarantee durability, reliability, efficient scaling, and elasticity. Replication is one of the approaches that allows your system to become more fault-tolerant and achieve better horizontal scalability.

There are many types of replication, but the most common is single-leader or master-slave replication. In this type of replication, all writes are processed exclusively by the leader node, while reads can be processed both by the leader and replica nodes.

The single-leader replication may be synchronous, asynchronous, or semi-synchronous. Each method has its pros and cons, and semi-synchronous replication seems to be the most widespread in modern systems.