Consistent Hashing in Distributed Systems
How do you know where your data should be stored? And equally as important, what happens when a storage node fails or you add a new storage node to scale up your system? A hash is a common and efficient way to determine where your data should be stored in this scenario and it is especially important to understand if you are implementing any kind of data streaming application.
What is hashing?
Hashing uses ‘hash tables’ and ‘hash functions’ or algorithms to convert variable-sized data, such as a file or arbitrary string, into a consistent fixed-size output.
The hash function (or consistent hashing algorithm) is a mathematical operation that takes your data and outputs a number to represent that data, whose value is far less than the original size of the data it was provided. The number output by the hash function is called the ‘hash,’ and the value will be consistent every time the hash function is called with the same data. You can then lookup the data in the hash table
For a VERY simple example, consider the modulo operator (%) as a hash function. Regardless of the size of the number provided to the modulo operator, it will return a consistent number within a specified range n, for [data] mod n:
69855 MOD 10 = 5
63 MOD 10 = 3
916 MOD 10 = 6
26666 MOD 10 = 6
The last example illustrates a ‘collision,’ where different inputs produce the same hash value, 6. You can also see how you could change the output range by adjusting the hash function, e.g.,
x MOD 20 or
x MOD 30.
In real hash implementations, more complex hashing functions are used to handle collisions and make the most efficient use of memory, depending on the expected input data.
Why use hashing?
A common example of hashing is data storage and retrieval. Let’s say you had a collection of 15 billion items, and you need to retrieve a specific item; how would you do that? One mechanism way would be to iterate over your dataset, but this is inefficient as the number of keys in the key-value store grows; it is better to use a hash.
This would work as follows:
When storing the data items, first run a key associated with your data through a hash function. Based on the resulting hash, store your data in the hash table at the hashed location. In the event of a collision, you will need to take steps to ensure that no data is lost; for example, the hash table might store a linked list of data instead of the data itself (known as chained hashing), or your hash function might continue looking for an empty slot until one is found (techniques include probing or double hashing).
When retrieving data items, again run the key associated with your data through the hash function, then look up the resulting hash code in your hash table.
If I already have a key to my data, why don’t I just use that key directly to store my data? E.g. as an index to an array. You could (this is the equivalent of using a hash function that does nothing), but it is very inefficient. What if your keys are very large and non-contiguous?
Hashing in a distributed system
The previous example discussed using a hash table to store and retrieve a large list of data items, but what if that list is physically distributed across multiple servers?
You can use a hash function to determine where data should be stored in a distributed system. Instead of the hash function pointing to a location in a hash table, as in the previous example, a distributed system could use a different hash function whose output is a server ID, which can be mapped to an IP address.
For example: When storing data in a distributed system, first hash a key associated with your data to produce the server ID where your data should be stored, then store the data on that server. If you have chosen a hash function whose output is uniformly distributed across all available servers, none of your servers will be overloaded, and your data can be quickly retrieved. You can also consider load balancing along similar lines, where the number of servers increases to balance server load.
Let’s say you have a list of 5 servers that can hold your data, and you have assigned each of them a unique identity, from 1 to 5. In this example, you decide which server to store the data on based on a hash of that data's key:
Produce a hash code to represent the key of the data you are storing; the resulting hash code should fall in the range of 1 to 5 (since we have 5 servers)
The earlier example used a very simple modulo for the hash function; the equivalent here would be to think of our hash function being “
[key] MOD 5”.
Store the file on the server whose ID has been calculated.
This will work, but what happens when a server goes down? What happens when a new server is added? In both cases, all data needs to have its hash recalculated and moved based on the result of its new hash - known as rehashing, this is inefficient and why the above approach is too simplistic to be used in production.
What is consistent hashing?
Consistent hashing determines where data should be stored in a distributed system whilst tolerating servers being added or removed, minimizing the amount of data that needs to be moved or rehashed when infrastructure changes occur.
Difference between consistent hashing and hashing
Hashing is a general technique to take some variable amount of data and reduce it so it is stored in a consistent and fixed-sized storage structure known as a hash table for quick and efficient retrieval. The size of the hash table chosen will depend on the amount of data being stored and other more practical considerations, such as available system memory. When the hash table is resized, typically, the hash function will also change to ensure that data is evenly spread across the newly resized table, meaning all data must be rehashed and moved.
Consistent hashing is a special kind of hashing whereby when the hash table is resized, only a small portion of the data needs to be moved. Specifically, the number of data items that need to be moved is
n is the number of data items, and
m is the number of rows in the hash table (or servers, in our earlier distributed system example).
How does consistent hashing work?
With consistent hashing, imagine the servers (or any virtual node) holding your data arranged in a circle and numbered clockwise. To make things more intuitive, most illustrations will number the nodes from 1 to 360 (degrees), but the actual numbers chosen, and whether or not they are contiguous are unimportant; what matters is that the assigned server IDs ascend as you trace a route around the circle in a clockwise direction from one server to another. The ID of the next node will always be greater than the previous, until wrapped around.
When determining which server the data should be stored on, the hash of the data is calculated, and then the modulus of that hash is taken with 360; the resulting number will probably not coincide exactly with a valid server number, so the server ID which is closest to and greater than the resulting number is selected. If there is no server whose ID is greater than the resulting number, wrap-around back to 0 and continue looking for a server ID.
As an example:
Some data key exists, with a hash function applied to it, the outcome of which is 5562.
5562 MOD 360 is 162
We find the server in our deployment whose ID is greater than but closest to 162, e.g., id 175 in the diagram below.
What if I do not have 360 servers? Or more than 360 servers? The numbers are unimportant, only the principle. As long as the selected server IDs are numbered in ascending order, reasonably evenly spaced, and conceptually arranged in a ring (known as a hash ring).
Why use consistent hashing?
With our data being spread across multiple nodes (or servers), it becomes much more efficient to add or remove a node from our system. Let’s consider an example where we have 3 nodes, A, B, and C, with IDs 90, 180, and 270. Each node in our setup contains one-third of the data.
If we want to add a new node, D, we assign it a new ID, say 360.
We now have a choice whether to leave D empty and allow it to fill slowly as data is stored or to redistribute data from other nodes into D. Which choice we make depends on the solution being implemented; web caching and cache server implementations will do the former and allow cache misses to slowly populate D. Other solutions will take half of the data from A and transfer it to D (without getting too deeply into the mathematics the data transferred needs to have a hash whose value would result in it being stored in D if it was new data).
What is a consistent hashing based naming service?
The previous examples in this post have considered storing data or files on servers, but the principle can be extended to storing any data in a distributed manner. Naming services are amongst the largest and most distributed systems created and lend themselves well to being implemented by consistent hashing. Putting aside DNS for the moment, which has a specific and different implementation, we could implement a more generic naming service using consistent hashing as follows:
Naming records are stored across multiple servers.
Servers should be assigned an ID and arranged in ascending order in a conceptual ring.
To resolve a name, we must locate the server where the name record is stored.
Take a hash of the name and then take the modulus of the result with 360 (assuming we have fewer than 360 servers)
Find the server ID whose name is closest to but greater than the result found in the previous step. If no such server exists, wrap around to the first server in the ring with the lowest ID.
The name record will be found on the identified server.
You can also consider content delivery networks that work in a similar way, balancing requests across different nodes to avoid hot spots
Consistent hashing implementation
There are multiple existing implementations available on Github, ranging from Java to Python, where a number of nodes are arranged in a consistent hashing ring containing a distributed hash table.
Production storage systems exist that use consistent hashing, such as Redis, Akamai, or Amazon’s dynamo, many of which will use more complex data structures, such as memcache, binary search, or random trees for the final step of data lookup.
Steps to implement consistent hashing
Having explained the underlying technique, you can start implementing a consistent hashing solution, but you should consider the following:
How many servers will you have in your solution? Do you expect this number to grow?
How will you assign server IDs?
Which hashing function will you use to create your data hash? You will want to choose a function that produces uniform output so your data does not cluster on a single server and will be evenly distributed across all your servers.
Consider what happens when you add or remove a server from your solution. You should try and keep server IDs as evenly spaced as possible to avoid clustering your data.
Consistent hashing implementation example
Let’s consider an example where files are stored on one of 5 servers. We will assign server IDs between 1 and 360.
Servers are arranged in a conceptual ring and assigned IDs 74, 139, 220, 310 and 340
We need to store a new file, we hash this file's key, and the result is 1551
We perform 1551 MOD 360, which gives us 111
The data file will therefore be stored on the server with ID 139
To consider another example:
Servers are arranged in a conceptual ring and assigned IDs 74, 139, 220, 310 and 340
We need to store a new file, we hash this file's key, and the result is 1075
We perform 1075 MOD 360, which gives us 355
No server ID exceeds 355; therefore, the data file will be stored on the server with ID 74
Consistent hashing optimization
If a node (server) fails in a system using consistent hashing, by design, all the data will be reassigned to the next server in the ring. This means that a failing server will double the load on the next server, which, at best, will waste resources on the overloaded server and, at worst, might cause a cascade failure. A more even redistribution on server failure can be achieved by having each server hash to multiple locations or replicas on the ring. For example, a server has 4 possible hashes and so exists in 4 places in the ring, now when this server fails, the data stored on this server is redistributed to 4 different places, so the load on any other server does not exceed an additional quarter, which is more manageable.
Consistent hashing implementation with PubNub
PubNub is a developer API platform that enables applications to deliver messages on a massive, global scale. PubNub has multiple points of presence, allowing it to deliver messages securely, reliably, and quickly.
To deliver the scale and reliability our customers expect, we use several techniques on our backend, including consistent hashing.
If you are creating an application that needs to exchange data between clients or between a client and server at scale, then you probably have much more to think about than just which hashing mechanism to use. Let PubNub handle all your communication needs so you can focus on your business logic and not worry about the low-level data transmission of your app as it scales.