System Design Interview Patterns-II
Incase you haven’t checked out System Design Patterns-I , i would recommend you check it out. The blog covers 6 different system design patterns that are frequently used in system design interviews.
In the current blog, i am covering 4 more system design patterns that would be handy for any system design interview.
Calculate checksum and store it along with data. Checksum is calculated by using a cryptographic hash function like SHA-1, SHA-256 or SHA-512. The hash takes the input data and produces an alphanumeric string of fixed length, which is known as checksum.
In distributed system, when moving data between nodes, there is a possibility that data gets corrupted before reaching the destination node. The corruption could occur due to faults in network, storage, software etc. How can data integrity be ensured, so that client gets an error when corrupted data is received?
While transferring data across components, the system calculates checksum of the data and adds it along with data. When the client receives data, it retrieves the checksum and matches it with the hash of the received data. If they don’t match, the client requests sender for another replica of data.
CAP Theorem says that it’s impossible to provide all three below properties in distributed system at the same time:
- Availability (A): Availability means that every request received by an alive node results in a response. Even if there are severe network failures, every request must complete successfully. In other words, system remains accessible even if one of the nodes go down.
- Consistency ( C): Consistency means that users can read or write to any node in the system i.e. all the users receive the same data from every node. This is equivalent to having one single upto-date copy of the data.
- Partition tolerance (P): A partition is a network failure between two nodes in the system i.e. nodes are up but unable to communicate. A partition tolerant system continues to operate even if there are partitions in the system. This is only applicable when network failures do not result in entire system failure. The data is replicated across multiple nodes and the system is up & working even through temporary failures.
In distributed system, diverse kinds of failures occur eg network failure, hardware failure, making a part of system inaccessible. How can distributed system model itself to get the best out of all the available resources?
Any distributed system has to pick up two out of three properties — A, C, P.
P is not really an option as any distributed system that is out there, will have network partition at some point and we want the system to continue to operate despite network failures. Hence in case of network partition, a system has to choose been Availability or Consistency and give up the other.
The CAP Theorem states: In event of network failure, it is possible to provide availability or consistency — but not both.
BigTable falls in the category of CP at the cost of high availability
Dynamo is a high availability system that falls in the category of AP at the cost of strong consistency. This was based on the observation that high availability is directly related to number of customers served.
PACELC theorem states that in system that replicates data:
- in case of network partition, the system trades off between availability and consistency
- otherwise, the system has to tradeoff between latency and consistency
Network partition cannot be avoided in distributed system, therefore as per CAP Theorem, the system has to choose between availability and consistency.
Databases supporting ACID model chooses consistency, while those supporting BASE choose availability. Check here to understand ACID and BASE model.
The area that we are missing out information on — what happens if there’s no network partition?
The first part of theorem is CAP theorem, while the second part is ELC. When replicating, we assume that system is maintaining availability the entire time.
Hence, incase of network failure, CAP theorem reigns. Otherwise, the system has to choose between latency and consistency
Cassandra and Dynamo: They choose Availability in case of Partition, otherwise they choose latency.
Big Table and HBase: They choose Consistency in case of Partition, otherwise they choose consistency.
In case nodes are down, the system keeps hints (or notes) of all the requests they have missed. Once the failing node recovers, the requests are forwarded to them based on the stored hints.
A distributed system can still serve write requests even if nodes are down. Consider, if there are 5 followers and the system is writing with quorum consistency level. Then even if two nodes die, the system can still serve write requests successfully as the three alive nodes fulfill consistency criteria. Now, how do we update the node which was down comes online again?
When a node is down, the leader writes a hint in a text file on local disk. This hint contains the data along with the node information to which it belongs. When the leader realizes that a node for which it is holding hints has recovered, it forwards the write request for each hint to the node.
Thank you for checking my blog!
In the upcoming blog, i will be covering 6 more system design patterns. So, follow along if you want to get notified about my upcoming blog!