Consistent Hashing

SAKSHI CHHABRA
6 min readSep 24, 2023

Consistent Hashing is a distributed hashing scheme to distribute cache keys efficiently and evenly across servers. In this blog, we are going to discuss emergence and concepts of consistent hashing, how are requests re-distributed when server is added/removed, introduction to virtual nodes, and we’ll wrap up the blog with code.

Consistent Hashing where black colored requests are sent to next server in circle

Assumptions:

We are working under the assumption that servers are cache servers which means that servers store user-specific information.

Cache:

Before we dive into consistent hashing, let’s clear up what cache is. A cache is a storage location for data that is frequently accessed. It is used to improve the performance by storing data that is likely to be used soon in a faster location. More on Distributed Caching here.

Requests, client and load are interchangeable terms

Distributed Hashing:

To avoid single point of failure and serve multiple requests simultaneously, we use more than 1 server in real-time. To seamlessly distribute requests between multiple servers, we put a load balancer before servers that decides which request goes to which server.

Load balancer distributes requests among 4 servers

The easiest way to decide which request goes to which server is by using hash function. For ex: if there are n servers, the common way to balance load is to use following hash function:

server = hash(request) % n

Lets see few examples -

In this ex: we are using 4 servers to serve requests. Atlanta is stored on server 2, Denver on server 1, Miami & Colorado on server 3, San Jose on server 0

Here, Hash converts maps into numeric value, which hash func maps the numeric to server .

The Rehashing Problem:

Distributed hashing works when server count is fixed and request distribution is even. However, problems arise when servers are added, removed or if requests don’t get distributed evenly across servers.

For ex, if a server becomes unavailable, then server count becomes 3 and hash % 3 updates accordingly-

Most of the hash % n changes, hence most of the requests would be redirected to a different server. This means that most clients will be connecting to wrong server resulting in many cache misses. To avoid cache misses, we will have to move requests to a different server based on the new hash func. This is rehashing problem where updating server count causes nearly all the requests to be remapped.

Consistent hashing handles this problem quite elegantly.

Consistent Hashing:

Consistent hashing is a hashing technique where resizing servers affects k/n keys on average, where k = total number of cached keys and n = no of cache servers.

Assume f is hash function, and output of hash func f is: x0, x1, x2, …, xn. If we put all the output values on hash ring, x0 would be at angle 0, while xn would at angle 360 and all other values would linearly fit in.

Using same hash func f, we map cache servers onto the ring -

In similar way, we can map all the keys onto the ring—

Server lookup:

To determine which server a key gets stored on, we go clockwise from the key location until a server is found. The key is stored on first server we encounter when going clockwise.

key1, key2 are mapped to server1, key3 is mapped to server2, key4 is mapped to server3, key5 is mapped to server4

Add a server:

With above logic in place, adding a server will only require redistribution of fraction of keys.

if server 5 is added, only keys between server 4 and server 5 are remapped. In this ex: only key1 is remapped from server 1 to server5

Remove a server:

Removing a server also requires redistribution of only fraction of keys.

if server 3 is removed, all the keys mapped to server3 will be remapped to next server in ring. In this ex: key4 will be remapped to server4

Problems with basic approach and how to solve it:

Above approach works perfectly under the assumption that partition size is uniform and keys are uniformly distributed onto the servers. But in real world -

  1. It is not possible to have uniform size of partitions on the ring. A partition is the hash space between two adjacent servers. It is possible to have very small/ very large partition between two servers.
  2. It is possible to have non-uniform distribution of keys onto the ring. Nearly all the keys could be mapped to one server, while others don’t get much keys.

This problem is solved by virtual nodes.

Virtual Nodes:

Virtual Nodes also referred as virtual replicas, is an extension to consistent hashing that aims to improve data distribution uniformly on all servers. Instead of mapping each physical node to a single point on ring, each server is represented by multiple virtual nodes on the ring. This results in more even distribution of keys across the nodes.

s0 = hash partitions that are mapped to server 0, s1 = hash partitions that are mapped to server 1

To understand virtual nodes, we are arbitrarily choosing 3 virtual replicas for each server. In real world this count is much larger. Instead of using s0 on hash ring, we are using s0_0, s0_1, s0_2 to represent server 0 on the ring. With virtual nodes, each server is responsible for multiple partitions.

To find which server a key gets stored on, we go clockwise from key’s position and find the first virtual node on the ring.

As number of virtual nodes increase, key distribution becomes more balanced on the ring. The outcome of an experiment carried online shows that with 100–200 virtual nodes, standard deviation is between 5% — 10% of the mean. This means that only 5% — 10% of keys are not evenly distributed. Standard deviation measures how data is spread out. Standard deviation becomes smaller as we increase count of virtual nodes. However more space is needed to store virtual nodes which is a tradeoff and needs to be fine tuned based on system requirements.

Below code will give better understanding of consistent hashing using virtual nodes-

import hashlib

class ConsistentHashing:
def __init__(self, nodes=[], replicas=3):
self.replicas=replicas
self.ring = {}
self.sortedNodes = []

for node in nodes:
self.addNode(node)

def addNode(self, node):
for i in range(self.replicas):
key = self.hashKey(node)
self.ring[key] = node
self.sortedNodes.append(key)

self.sortedNodes.sort()

def removeNode(self, node):
for i in range(self.replicas):
key = self.hashKey(node)
del self.ring[key]
self.sortedNodes.remove(key)

def getNode(self, key):
if not self.ring:
return None

hashedKey = self.hashKey(key)
for key in self.sortedNodes:
if hashedKey <= key:
return self.ring[key]

# if hashedKey is greater than all the keys in ring, return the first key
return self.ring[0]

def hashKey(self, key):
# for this ex, we are using sha256 as hash func
return int(hashlib.sha256(key.encode()).hexdigest(), 16)

if __name__ == "__main__":
nodes = ["nodes1", "nodes2", "nodes3"]
ch = ConsistentHashing(nodes)

key = "testKey"
selectedNode = ch.getKey(key)
print("key is mapped to server - ", selectedNode)

Benefits of consistent hashing-

  1. Minimized keys are re-distributed when servers are added or removed
  2. Easier to scale horizontally since only k/n keys would be affected.
  3. Mitigate hotspot problem. Chances of excessively accessed keys arriving on same virtual node is very low.

Popular Distributed Systems that use consistent hashing:

  1. Amazon DynamoDB
  2. Apache Cassandra
  3. Akamai Content Delivery Network

❤ Thank you for reaching this far! Please like, comment and subscribe. If this blog gets 500+ likes, i’m going to post system design blogs weekly❤

Website used for diagrams :— https://app.diagrams.net/

--

--

SAKSHI CHHABRA

Master's student in Computer Science from University of Florida. I love to write and help others, so here am i.