Maglev consistent hashing in Rust
Andreas Hohmann July 22, 2024 #maglev #hashing #rust #networkingIn a paper published in 2016, a group of Google engineers describes Google's network load load balancer called Maglev. This system uses a new (at the time) consistent hashing algorithm to select the target that a packet is sent to. This algorithm is now commonly known as "Maglev consistent hashing" and offered by several load balancers including Envoy. This post explores an implementation of this algorithm in Rust.
Before diving into the implementation, let's rehash (pun intended) the algorithm and the problem that it's trying to solve. Given an incoming packet, a load balaner has to select one of potentially many target servers to forward the packet to.
Using slightly more generic language, we have to send some input to a target. The input can be a packet (for a network load balancer) or an application layer request (for an application load balancer). The input has some characteristic fields that the load balancer uses for its decisions. In case of an IP packet, we have the 3-tuple consisting of source IP, destination IP, and transport protocol. In case of a TCP or UDP packet, we have the 5-tuple consisting of source IP, source port, destination IP, destination port, and application protocol.
These fields identify the service that should handle the input. The destination IP of an IP packet, for example, is often a virtual IP (VIP) of a particular service (typically obtained via DNS from the service's domain name), and the load balancer has to send the packet to a target that hosts this service.
For the discussion here we are going to assume that all targets can process all inputs. However, some target servers may be more powerful than others. We model this by giving each target a (small) non-negative integer weight. It's the load balancer's job to distribute the inputs fairly among the targets based on these weights.
Inputs with the same characteristic fields should be sent to the same target. In case of TCP, the packets will be part of a TCP connection (after the TCP handshake), but even in case of UDP and other connection-less IP protocols, this kind of target affinity helps to take advantage of caching and other techniques that work best if the target that a client talks to stays the same.
The load balancer itself may consist of multiple physical servers (instances) that the packets are routed to by network devices such as hardware routers and switches. Packets with the same characteristic tuple will not necessarily be processed by the same load balancer instance, but the outcome (which target is picked) should be the same. We would like to achieve this without complex synchronization between the load balancers. Ideally, each load balancer instance should be able to operate independently.
The set of targets may change as servers are added or removed, whether due to deliberate changes or unexpected events (crashes, hardware failures). These changes should have minimal impact on existing connections.
Taken together, the target selection algorithm should support the following requirements:
- distribute the inputs to the targets according to their weights
- send inputs with the same characteristic tuple to the same target
- pick the same target for the same input on all load balancer instances
- don't require synchronization between load balancer instances
- be efficient (in terms of memory and computation)
Of course, it's impossible to satisfy all these requirements perfectly. If a target is removed, we cannot sent packets to it anymore and therefore have to violate the second requirement, but we can at least try to minimize the impact of such changes on existing assignments. Similarly, different load balancer instances may have different information about the targets and may therefore compute different assignments.
The Maglev algorithm is an example of a "consistent hashing" algorithm. These algorithms use a hash of the characteristic tuple to select a target while trying to stay close to the requirements (that the "consistent" part). This kind of selection algorithm is typically combined with an in-memory table keeping track of existing "connections" (the conn-track table).
A popular consistent hashing algorithm is a hash ring. Logically, this algorithms distributes the targets on a circle divided into equal-sized slots. The number of slots $M$ is chosen to be significantly larger than the number of targets $N$, let's say, 10000 slots for 100 targets. Each target is mapped to a slot by computing a hash of the target modulo $M$. When selecting a target for an input, we hash the characteristic fields of the input, find the slot (hash modulo $M$), and then look for the next target on the ring in clockwise order. When a target is added or removed, only inputs that map to an affected segment of the ring will change their assigment. On the flipside, the distribution may become unbalanced. If a target is removed, it's load will be assigned to the next target on the ring whose load therefore essentially doubles.
The Maglev algorithm builds on the idea of the hash ring, but prioritizes the equal distribution of the load (the load-balancing aspect) over minimal disruption. The algorithm distributes the targets in a pseudo-random way onto all $M$ slots (not just $N$ out of the $M$ slots as in the ring hash). The hash value of the input modulo $M$ then points to a slot again, and the target in this slot is selected.
The pseudo-random distribution needs to be "stable" for the targets that stay the same when targets are added or removed. The Maglev algorithm achieves this by giving each target a "preference list" of slots. A preference list is a permutation of the slots numbers, and each target has its own such permutation. The slots are filled with the target numbers by letting the targets take turns to pick their next favorite slot that's not already taken. When a target is removed, some other target will take over the removed target's turns, but the slots of the other targets will stay mostly the same.
This leaves the question of how to define the pseudo-random permutations for each target. If the number of slots is a prime number $M$, each nonzero integer generates the integers from $0$ to $M-1$, that is,
for all $k\in{0,\ldots,M-1}$ and $l\in{1,\ldots,M-1}$.
A target can be identified by some name $t$. If we set $k$ and $l$ to the result (modulo $M$) of two different hash functions applied to this name
we can use the generated sequence
for $j=0,\ldots,M-1$ as a permutation of the slots $0,\ldots,M-1$. Multiplication modulo some prime number scrambles the numbers nicely (explaining its use in encryption) so that these permutations look pseudo-random. This gives us pseudo-random permutations that can be computed efficiently from the target without any synchronization between the load balancer instances (beyond the target set).
Let's look at an example with $M=11$ slots and $N=3$ targets.
When we let the targets take turns and fill the slots with their preferences, target 0 gets slot 5, target 1 gets slot 9, and target 2 gets slot 3. Then it's target 0's turn again with its next preference, slot 2. Target 1's next preference is slot 3, but that's already taken and so is the next one, slot 9, so it ends up with slot 1. In the end, we arrive at the following distribution of the 3 targets on the 11 slots:
The algorithm ensures that each target takes $\lfloor M/N\rfloor$ or $\lceil M/N \rceil$ slots, in this case 4 slots for target 0 and 1, and 3 slots for target 2.
Let's translate this to Rust. The following function populates the slots with the indexes of the targets based on their preference lists. The actual algorithm takes only 11 lines.
/// Populates the target slots according to the Maglev algorithm.
///
/// `preference(i, j)` is the `j`-th element of the preference list of target `i` where `i` is
/// a target index (0 <= i < target_count) and 'j' a slot index (0 <= j < target_slots.len()).
The index m
is the current slot, and i
is the current target. I was
initially tempted to use Rust iterators to model the preference lists, but a
single function (passed as a Fn
trait object so that we can use closures)
turned out to be much simpler. As in the paper, we keep the current indexes of
the permutations in a next
vector.
We use usize
for all indexes because Rust's index operator requires this
type. A smaller unsigned integer type would be slightly more memory-efficient
(at least on 64 bit architectures), but the required casts would clutter the
code. The unsigned index type is also the reason for using the target_count
as the value indicating that the slot is not set yet (the paper uses -1
instead).
Let's test the function using the example shown earlier.
This first version does not support weights yet. The targets take turns using
the simple formula i = m % target_count
. This does not work anymore when we
include weights and let the targets take turns according to their weights. The
for-loop filling the slots becomes a while loop, and targets with weight zero
have to be skipped. If all weights are zero, we cannot fill the slots at all.
/// Populates the target slots according to the Maglev algorithm.
///
/// `preference(i, j)` is the `j`-th element of the preference list of target `i` where `i` is
/// a target index (0 <= i < target_count) and 'j' a slot index (0 <= j < target_slots.len()).
///
/// Returns true if the slots were filled and false if all weights are zero so that the slots
/// cannot be filled.
Each target gets as many turns in a row as its weight. We set k
to the weight
of the current target i
and count down. When we reach zero, we move on to the
next target (cyclically). If all weights are 1, the function is equivalent to
the previous version without weights. An early exit takes care of the special
case of no positive weight (no target available) so that we don't need another
check in the main loop.
Let's apply this to our example with different weights to confirm that the algorithms assigns the slots according to the weights.
The results demonstrate indeed the behavior we were looking for. Removing target 1 reassigns its slots to the other targets while the other slots stay the same with the exception of one flip from 0 to 2. Similarly, doubling the weight of target 1 gives it more slots while keeping the other slots of target 0 and 2.
Now that we have implemented the core algorithm, it's time to think about the API. A hash-based target selector computes a target index from an input hash. We have to take into account that the assignment may fail if there is no target with positive weight. Let's create a custom error type for that:
use Error;
use ;
With this error type, we can define the trait for the target selector.
/// Selection of targets based on a hash of the input.
Which data does the Maglev implementation of this trait need? We definitely need the target slots. It will also be useful to know the weights. They give us the target count and determine if we can find a target at all.
In the remaining code snippets we will omit the documentation because the functions are either self-explanatory or described in the text.
The target_counts
function returns how many slots each target has. That's going to
be handy in the forthcoming tests. Let's first rewrite the initial example in terms
of the new structure.
We have not covered the actual hash functions yet. Fortunately, Rust's hashing API makes this a breeze. For simplicity, we are going to use the default hasher and construct different hash functions by adding an initial seed.
use ;
We need three hash functions, two for the start and stride of the targets to
generate the preference lists and another for the input hash. A set of hash
functions for different purposes is common enough to deserve its own structure.
We define a new HashFactory
trait that gives us a Hasher
for some hash
type. The default implementation keeps a seed for each hash type.
use HashMap;
The hash
convenience function applies the hash function of given hash type to
an arbitrary value (which has to implement the Hash
trait). The
hash_with_limit
function takes this hash modulo some limit.
The three hash types for the Maglev algorithm are Start
, Stride
, and Input
. We
initialize the associated seeds with some arbitrary numbers.
type MaglevHasherFactory = ;
This allows us to create the preference function for any list of target as long as these target objects are hashable.
Let's add one more test using this whole setup with a list of targets given as strings.
Rust allows us to implement these structures and functions in a generic way without much ceremony. We could apply them without change to more complex target and input objects such as the characteric tuples of packets.