Replication in MongoDB: Simultaneity of the Non-Simultaneous (Part 1)

Database replication, eventual consistency and anomalies

Christian Del Monte
7 min readJul 29, 2020

Reducing latency, increasing read throughput and guaranteeing availability: these are all priority objectives for everyone wanting to improve the resilience and the performance of a computer system distributed over the internet. A method to achieve such goals is maintaining copies of the same data on multiple machines, independent of each other and geographically well distributed, thus allowing to keep data closer to the end users and simultaneously serving a larger number of queries in a given time-window. Furthermore, it improves the capability of a computer system to continue working despite the failure of one of its parts. This strategy is commonly called replication.

M. C. Escher, Day and night February 1938, woodcut, printed in grey and black inks Escher Collection, Gemeentemuseum Den Haag, The Hague, the Netherlands. © The M. C. Escher Company, the Netherlands. All rights reserved.

In this series of articles I’ll present how replication is implemented in MongoDB, focusing on best practices to avoid common pitfalls and to mitigate relevant issues. But before we start talking about MongoDB and its replication mechanisms, let’s introduce with this first article the replication as such, illustrating how it works and which issues are to be taken in account.

Types of Replication

There are three main types of replication: the leader based, the multi-leader based and the leaderless replication.

In the first one only one replica called leader (also known as master or primary) serves write / read requests, while the other replicas, the followers (also known as slaves, secondary or stand-by replicas) are exclusively dealing with read queries. In order to propagate data, in addition to performing write operations the leader also takes care of appending data changes to a replication log. This is consumed by the follower replicas, which update their local copy of the data accordingly, thus keeping the whole system synchronized. This model, also known as active / passive replication, is widely adopted by many databases: MySQL[1], PostgreSQL[2] (since version 9.0), SQL Server Always On[3], Oracle Data Guard[4] and Amazon DynamoDB[5] are among them. More relevant for us, this is the mechanism utilized in MongoDB as well.

The second form of replication is an extension of the leader-follower model. The mechanism used to propagate data changes is the same, the difference consists of the presence of multiple leaders, i.e. more nodes in the db cluster able to accept write requests at the same time.

Leaderless replication, the last one, uses a totally different approach: all replicas accept both write and read requests. This makes the system highly available, but reduces at the same time the possibility of achieving a strong consistency state. Eventual consistency is the trade-off you have to live with. Amazon’s Dynamo[6], Cassandra[7], Riak[8] and Voldemort[9] are some of the databases using this solution.

Leader based Replication

As mentioned before, MongoDB adopts the leader based replication, therefore let’s now put aside the other two replication methods, to focus on the way in which the leader and its followers communicate with each other in the leader-based one.

In replication, a central role is played by the way in which it happens: synchronously or asynchronously. Bottlenecks, data loss and localized inconsistencies in the distribution of data changes depend on it.

Generally speaking, replication is defined as synchronous if the leader node waits for a receive confirmation from a so called synchronous follower before sending a success response and continuing propagating the new writes to its remaining followers. This assures that, at worst, a full replica of the leader exists at all times. On the opposite, asynchronous replication doesn’t give us this guarantee: the leader doesn’t have a synchronous follower and will continue processing the write requests regardless of the followers, which may have been left behind.

If, at a first look, synchronous replication seems to be the best fit to provide availability and durability, problems arise when the synchronous follower isn’t available and the leader must block all write operations waiting until it is back. That wouldn’t happen with a fully asynchronous replication. On the other side, in the asynchronous form data changes may get lost if the leader crashes and its followers have not yet replicated the last writes, a disaster scenario for software solutions where data loss is unacceptable.

Regarding node outages, it is common to distinguish between follower and leader failure scenarios. If a follower outage happens due to a crash or a transient network issue, the recovery procedure is pretty straightforward: as soon as the follower is back, it connects to the leader and retrieves all missing data starting from the last transaction present in its log. This process is called catch-up recovery. If the leader node crashes or is unreachable, a new one is elected and both followers and clients need to be configured to start consuming from it and sending write requests to it, respectively. This process is commonly referred to as failover. While the first process does not present particular problems, the second one deserves more attention, also because of the different types of participants involved: leader, followers and clients.

Eventual Consistency and Replication Lag

So, let’s return to focus on the temporal aspect of replication. Immediately after a write operation, whether it is synchronous or asynchronous, not all replicas are in sync with the leader: it will take some time to restore the information consistency of the system as a whole. This state is commonly referred to as eventual consistency and does not present particular problems as long as the delay does not exceed a given threshold. From a certain point forward, however, it can become problematic, causing what is normally referred to as replication lag. More generally, given a follower node, replication lag is the delay between the time an operation occurs on the leader and the time that the same operation gets applied on the follower.

Replication lag causes some consistency anomalies when reading data. The most common are the read-after-write, the moving-backward-in-time and the causality-violation anomaly. The read-after-write anomaly (s. fig. 1) occurs when a user executes a write operation on the leader and immediately performs a read to fetch the data written just before. If this read request is routed to a yet unsynchronized follower, the response will give back either an error or unactualized data. The moving-backward-in-time anomaly (s. fig. 2) happens when the client is doing sequential reads served from different, partially not synchronized followers, receiving different versions of the same data and thus having the impression to randomly jump backward and forward in time. The causality-violation anomaly takes place if reads of sequential writes don’t observe the order in which the writes happened. Although it sounds strange, this could happen for example in partitioned databases, where asynchronous replication added to the presence of a leader in each partition makes it difficult to specify a global order of writes.

Fig. 1: read-after-write anomaly
Fig. 2: moving-backward-in-time anomaly

To limit the extent of these anomalies offering a higher degree of reliability than that offered by the eventual consistency, it is possible to apply replication strategies that make it possible to provide stronger guarantees than eventual consistency alone. In his article: “Replicated Data Consistency Explained Through Baseball”, Doug Terry describes these additional guarantees as follows: “By requesting a consistent prefix, a reader is guaranteed to observe an ordered sequence of writes starting with the first write to a data object. (…) With monotonic reads, a client can read arbitrarily stale data, as with eventual consistency, but is guaranteed to observe a data store that is increasingly up-to-dateover time. (…) Read My Writes is a property that also applies to a sequence of operations performed by a single client. It guarantees that the effects of all writes that were performed by the client are visible to the client’s subsequent reads.”

Following Doug Terry’s terminology, I will refer to these guarantees as: the read-my-writes, the monotonic reads and the consistent prefix reads guarantees, respectively. Indeed, Martin Kleppman describes in his work “Designing data Intensive applications”[12] different ways to implement these replication strategies. I will show which of them is suitable for MongoDB.

Conclusions

After having introduced the topic of database replication in general and briefly having illustrated problems and open questions that derive from it, let’s see, in the next article, which are the answers and solutions that MongoDB offers to deal with them.

[1] “MySQL 8.0 Reference Manual :: 17.1.1 Binary Log … — MySQL.” https://dev.mysql.com/doc/refman/8.0/en/binlog-replication-configuration-overview.html. Accessed 24 Jul. 2020.

[2] “Streaming Replication — PostgreSQL wiki.” https://wiki.postgresql.org/wiki/Streaming_Replication. Accessed 24 Jul. 2020.

[3] “What is an Always On availability group? — SQL Server Always ….” 29 Apr. 2020, https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server. Accessed 24 Jul. 2020.

[4] “Introduction to Oracle Data Guard — Oracle Help Center.” https://docs.oracle.com/cd/B19306_01/server.102/b14239/concepts.htm. Accessed 24 Jul. 2020.

[5] “Amazon DynamoDB — Wikipedia.” https://en.wikipedia.org/wiki/Amazon_DynamoDB. Accessed 24 Jul. 2020.

[6] “Amazon’s Dynamo — All Things Distributed.” 2 Oct. 2007, https://www.allthingsdistributed.com/2007/10/amazons_dynamo.html. Accessed 24 Jul. 2020.

[7] “Cassandra — A Decentralized Structured Storage System.” http://www.cs.cornell.edu/Projects/ladis2009/papers/Lakshman-ladis2009.PDF. Accessed 24 Jul. 2020.

[8] “Replication — Riak Docs.” https://docs.riak.com/riak/kv/latest/learn/concepts/replication/index.html. Accessed 24 Jul. 2020.

[9] “Serving Large-scale Batch Computed Data with Project ….” https://www.usenix.org/legacy/events/fast12/tech/full_papers/Sumbaly.pdf. Accessed 24 Jul. 2020.

[10] “Eventually Consistent — Communications of the ACM.” https://m-cacm.acm.org/magazines/2009/1/15666-eventually-consistent/fulltext. Accessed 28 Jul. 2020.

[11] “Replicated Data Consistency Explained Through … — Microsoft.” https://www.microsoft.com/en-us/research/wp-content/uploads/2011/10/ConsistencyAndBaseballReport.pdf. Accessed 28 Jul. 2020.

[12] Designing Data Intensive Applications. The big Ideas behind reliable, scalable, and maintainable Systems, Kleppman, Martin, O’Reilly, 2017.

--

--

Christian Del Monte

Software architect and engineer with over 20 years of experience. Interested in data lakes, devops and highly available event-driven architectures.