System Design Interview Patterns-IV

SAKSHI CHHABRA
6 min readMay 13, 2022

Incase you haven’t checked out my previous blogs on part I, part II and part III of System Design patterns, i would recommend to check it out here. The blog covers 6, 4, 6 different system design patterns that are frequently used in system design interviews.

In the current blog, i am covering 5 more system design patterns that would be handy for any system design interview.

Read Repair

Definition:

Repair stale data during read operation by reading data from multiple nodes to carry-out comparison and figure out the nodes that have stale data. This process is known as Read Repair.

Once the nodes with stale data are found, read repair operation pushes the newer version of data to those nodes.

Context:

In distributed environment, where data is replicated across multiple servers, some nodes can end up having stale data. This could happen in scenario where a node is down or there is network partition. This results in the node failing to receive write or update request. How can we ensure that the nodes receives the latest version once the node is up again or network partition is resolved?

Solution:

Based on the quorum, the system reads data from multiple nodes. For quorum = 3, the system reads data from 2 nodes and calculates checksum of the data from second node. This is used to save network bandwidth. If value doesn’t match, it means some replicas don’t have latest version of the data. In this scenario, the system has to read data from all the nodes to get latest data. The latest data is returned to the client and Read Repair request is initiated in the background where the operation pushes latest data onto nodes with older version.

Example:

Cassandra and Dynamo use Read Repair to push their latest data onto nodes with older version.

Merkle Trees

Merkle tree

Definition:

A replica can contain a lot of data and its not very feasible to natively split up entire range to calculate checksum for comparison, there would be too much data to transfer. Merkle trees are used to compare replicas of range.

Context:

Read repair resolves conflicts while serving read requests. But, it might take a lot of time to resolve conflicts, incase a replica falls significantly behind others. It would be ideal to resolve conflicts in the background. To do so, we need to quickly compare copies and range, figure out exactly which parts are different and quickly resolve conflicts. How can we quickly compare two copies of range of data residing in two different replicas and figure out exactly which parts are different?

Solution:

Merkle tree is a binary tree of hashes, where each internal node is a hash of its two children and each leaf node is a portion of original data.

Comparing two Merkle trees is pretty straightforward — start comparing both the trees from the root. If they are equal, stop. Otherwise recurse on the left and right children. In this way, we find out the range difference between replicas with min data exchange.

The advantage of Merkle trees is that each branch of tree can be checked independently without having to download entire data set. Hence, Merkle trees reduces the data that needs to be transferred for synchronization and number of disk access.

The disadvantage of Merkle trees is that the tree needs to recalculated, whenever a node joins or leaves as many key ranges can sleep.

Example:

Dynamo uses Merkle trees for anti-entropy and to resolve conflicts in the background.

High-Water Mark

Definition:

High-Water mark index is the index of last log entry on leader that has been successfully replicated to quorum of followers. The leader exposes data only upto to the high water mark index.

Context:

Each transaction is added to write-ahead log, so that leader can recover from temporary crashes and failures. In distributed systems, multiple copies of data are stored for fault tolerance. For strong consistency, the leaders replicate all the log entries to quorum of followers. Now if the leader fails, a new leader is elected, and clients will continue to work with cluster as before.

But there are few things that can go wrong:

  • The leader can fail before sending log entries to its followers
  • The leader is successful to send log entries to a few followers, but fails before sending to majority of followers.

In this scenario, followers could be missing entries in their log. So, it becomes important for the elected leader and followers to know which part of log is safe to be exposed to clients.

Solution:

For each request, the leader first appends to WAL and then sends to its followers. When followers receive log from leader, they append it to WAL and then send acknowledgement back to the leader. The leader keeps track of the index that has been successfully added to follower’s WAL. High-Water mark index is the highest index that has been replicated to quorum of followers. The leader can send high-water mark index as part of heartbeat. The leader and followers ensure that client can only see data upto high-water index, hence log entries beyond high-water mark aren’t visible to clients as they aren’t replicated and might not be available incase leader fails. This guarantees that there aren’t any read inconsistencies even if the leader fails and another leader is elected.

Example:

Kafka

Phi Accrual Failure Detection

Definition:

Phi Accrual Failure Detection is an adaptive failure detection algorithm. Accrual means act of accumulating over time i.e. it accumulates historical heartbeat information and uses it to make threshold adaptable. Instead of telling if the server is alive or not, it outputs the suspicion level about the server. A higher suspicion level means higher chances that server has crashed.

Context:

In distributed system, accurate failure detection is hard since we cannot be 100% sure if the system is indeed down or is it just slow in responding due to network latency, intermittent server issue, etc. Standard approaches like heartbeating outputs bool value revealing if node is down or not. Heartbeat uses a fixed timeout value and if there isn’t any heartbeat for timeout interval, the server is declared dead. This makes timeout to be very critical value. If we keep timeout too short, server failures would be detected quicker but that would include many false positives too. On other side, if we keep timeout too long, false positives will reduce but it would be slow to detect failures. Any better approaches?

Solution:

When a node doesn’t respond, suspicion level of failure detector increases. With increase in suspicion level, the system gradually stops sending requests to the node. Phi Accrual Failure Detector makes the failure detection more efficient by taking the network environment fluctuations and intermittent server issues into considerations before declaring a node dead.

Example:

Cassandra: uses this algorithm to determine state of node in cluster

Follower Reads

Definition:

Read requests are directed from leader to follower for better throughput and reduced latency

Context:

In a leader-follower setup in distributed system, the leader could be overloaded with many requests. This would result in increased latency even for the less complex task. Any approach to reduce burden from leader in such situations?

Solution:

While the write requests should go to leader to maintain consistency across nodes, the read-only requests can go to follower. This is particularly useful when leader is overloaded with many requests.

It is possible that clients receive older values from followers. There will always be a replication gap between leader and follower, even if distributed system implements consensus algorithms. This is because there would be another message to communicate it to followers.

So, reading from followers is only useful in scenarios where slightly older values are acceptable.

Example:

MongoDB, CockroachDB

Thank you for checking my blog!

In the upcoming blog, i will be covering

--

--

SAKSHI CHHABRA

Master's student in Computer Science from University of Florida. I love to write and help others, so here am i.