Maglev consistent hashing in Rust

Andreas Hohmann July 22, 2024 #maglev #hashing #rust #networking

In a paper published in 2016, a group of Google engineers describes Google's network load load balancer called Maglev. This system uses a new (at the time) consistent hashing algorithm to select the target that a packet is sent to. This algorithm is now commonly known as "Maglev consistent hashing" and offered by several load balancers including Envoy. This post explores an implementation of this algorithm in Rust.

Before diving into the implementation, let's rehash (pun intended) the algorithm and the problem that it's trying to solve. Given an incoming packet, a load balaner has to select one of potentially many target servers to forward the packet to.

Using slightly more generic language, we have to send some input to a target. The input can be a packet (for a network load balancer) or an application layer request (for an application load balancer). The input has some characteristic fields that the load balancer uses for its decisions. In case of an IP packet, we have the 3-tuple consisting of source IP, destination IP, and transport protocol. In case of a TCP or UDP packet, we have the 5-tuple consisting of source IP, source port, destination IP, destination port, and application protocol.

These fields identify the service that should handle the input. The destination IP of an IP packet, for example, is often a virtual IP (VIP) of a particular service (typically obtained via DNS from the service's domain name), and the load balancer has to send the packet to a target that hosts this service.

For the discussion here we are going to assume that all targets can process all inputs. However, some target servers may be more powerful than others. We model this by giving each target a (small) non-negative integer weight. It's the load balancer's job to distribute the inputs fairly among the targets based on these weights.

Inputs with the same characteristic fields should be sent to the same target. In case of TCP, the packets will be part of a TCP connection (after the TCP handshake), but even in case of UDP and other connection-less IP protocols, this kind of target affinity helps to take advantage of caching and other techniques that work best if the target that a client talks to stays the same.

The load balancer itself may consist of multiple physical servers (instances) that the packets are routed to by network devices such as hardware routers and switches. Packets with the same characteristic tuple will not necessarily be processed by the same load balancer instance, but the outcome (which target is picked) should be the same. We would like to achieve this without complex synchronization between the load balancers. Ideally, each load balancer instance should be able to operate independently.

The set of targets may change as servers are added or removed, whether due to deliberate changes or unexpected events (crashes, hardware failures). These changes should have minimal impact on existing connections.

Taken together, the target selection algorithm should support the following requirements:

  1. distribute the inputs to the targets according to their weights
  2. send inputs with the same characteristic tuple to the same target
  3. pick the same target for the same input on all load balancer instances
  4. don't require synchronization between load balancer instances
  5. be efficient (in terms of memory and computation)

Of course, it's impossible to satisfy all these requirements perfectly. If a target is removed, we cannot sent packets to it anymore and therefore have to violate the second requirement, but we can at least try to minimize the impact of such changes on existing assignments. Similarly, different load balancer instances may have different information about the targets and may therefore compute different assignments.

The Maglev algorithm is an example of a "consistent hashing" algorithm. These algorithms use a hash of the characteristic tuple to select a target while trying to stay close to the requirements (that the "consistent" part). This kind of selection algorithm is typically combined with an in-memory table keeping track of existing "connections" (the conn-track table).

A popular consistent hashing algorithm is a hash ring. Logically, this algorithms distributes the targets on a circle divided into equal-sized slots. The number of slots $M$ is chosen to be significantly larger than the number of targets $N$, let's say, 10000 slots for 100 targets. Each target is mapped to a slot by computing a hash of the target modulo $M$. When selecting a target for an input, we hash the characteristic fields of the input, find the slot (hash modulo $M$), and then look for the next target on the ring in clockwise order. When a target is added or removed, only inputs that map to an affected segment of the ring will change their assigment. On the flipside, the distribution may become unbalanced. If a target is removed, it's load will be assigned to the next target on the ring whose load therefore essentially doubles.

The Maglev algorithm builds on the idea of the hash ring, but prioritizes the equal distribution of the load (the load-balancing aspect) over minimal disruption. The algorithm distributes the targets in a pseudo-random way onto all $M$ slots (not just $N$ out of the $M$ slots as in the ring hash). The hash value of the input modulo $M$ then points to a slot again, and the target in this slot is selected.

The pseudo-random distribution needs to be "stable" for the targets that stay the same when targets are added or removed. The Maglev algorithm achieves this by giving each target a "preference list" of slots. A preference list is a permutation of the slots numbers, and each target has its own such permutation. The slots are filled with the target numbers by letting the targets take turns to pick their next favorite slot that's not already taken. When a target is removed, some other target will take over the removed target's turns, but the slots of the other targets will stay mostly the same.

This leaves the question of how to define the pseudo-random permutations for each target. If the number of slots is a prime number $M$, each nonzero integer generates the integers from $0$ to $M-1$, that is,

for all $k\in{0,\ldots,M-1}$ and $l\in{1,\ldots,M-1}$.

A target can be identified by some name $t$. If we set $k$ and $l$ to the result (modulo $M$) of two different hash functions applied to this name

we can use the generated sequence

for $j=0,\ldots,M-1$ as a permutation of the slots $0,\ldots,M-1$. Multiplication modulo some prime number scrambles the numbers nicely (explaining its use in encryption) so that these permutations look pseudo-random. This gives us pseudo-random permutations that can be computed efficiently from the target without any synchronization between the load balancer instances (beyond the target set).

Let's look at an example with $M=11$ slots and $N=3$ targets.

When we let the targets take turns and fill the slots with their preferences, target 0 gets slot 5, target 1 gets slot 9, and target 2 gets slot 3. Then it's target 0's turn again with its next preference, slot 2. Target 1's next preference is slot 3, but that's already taken and so is the next one, slot 9, so it ends up with slot 1. In the end, we arrive at the following distribution of the 3 targets on the 11 slots:

The algorithm ensures that each target takes $\lfloor M/N\rfloor$ or $\lceil M/N \rceil$ slots, in this case 4 slots for target 0 and 1, and 3 slots for target 2.

Let's translate this to Rust. The following function populates the slots with the indexes of the targets based on their preference lists. The actual algorithm takes only 11 lines.

/// Populates the target slots according to the Maglev algorithm.
///
/// `preference(i, j)` is the `j`-th element of the preference list of target `i` where `i` is
/// a target index (0 <= i < target_count) and 'j' a slot index (0 <= j < target_slots.len()).
fn populate_maglev_slots_without_weights(
    target_count: usize,
    preference: &impl Fn(usize, usize) -> usize,
    target_slots: &mut [usize],
) {
    // Use target_count as sentinel value for "not set".
    target_slots.fill(target_count);

    // `next[i]` is the index of the next permutation value for target `i`.
    let mut next: Vec<usize> = vec![0; target_count];

    // Fill all slots.
    let slot_count = target_slots.len();
    for m in 0..slot_count {
        // Targets take turns.
        let i = m % target_count;
        // Look for target i's next slot that is not taken yet.
        let (j, p) = (next[i]..slot_count)
            .map(|j| (j, preference(i, j)))
            .find(|(_, p)| target_slots[*p] == target_count)
            // Value must exist because each permutation contains all slots.
            .unwrap();
        next[i] = j + 1;
        target_slots[p] = i;
    }
}

The index m is the current slot, and i is the current target. I was initially tempted to use Rust iterators to model the preference lists, but a single function (passed as a Fn trait object so that we can use closures) turned out to be much simpler. As in the paper, we keep the current indexes of the permutations in a next vector.

We use usize for all indexes because Rust's index operator requires this type. A smaller unsigned integer type would be slightly more memory-efficient (at least on 64 bit architectures), but the required casts would clutter the code. The unsigned index type is also the reason for using the target_count as the value indicating that the slot is not set yet (the paper uses -1 instead).

Let's test the function using the example shown earlier.

#[test]
fn test_populate_maglev_slots_without_weights() {
    // Prime slot count guarantees that all positive strides generate all slots.
    let slot_count = 11;
    let target_count = 3;

    let starts = [5, 9, 3];
    let strides = [2, 3, 5];

    let preference =
        |i: usize, j: usize| -> usize { (starts[i] + j * strides[i]) % slot_count };

    let mut target_slots = vec![0; slot_count];
    populate_maglev_slots_without_weights(target_count, &preference, &mut target_slots);

    assert_eq!(target_slots, vec![0, 1, 2, 2, 1, 0, 0, 0, 2, 1, 1]);
}

This first version does not support weights yet. The targets take turns using the simple formula i = m % target_count. This does not work anymore when we include weights and let the targets take turns according to their weights. The for-loop filling the slots becomes a while loop, and targets with weight zero have to be skipped. If all weights are zero, we cannot fill the slots at all.

fn has_positive_value(values: &[usize]) -> bool {
    values.iter().any(|weight| *weight > 0)
}

/// Populates the target slots according to the Maglev algorithm.
///
/// `preference(i, j)` is the `j`-th element of the preference list of target `i` where `i` is
/// a target index (0 <= i < target_count) and 'j' a slot index (0 <= j < target_slots.len()).
///
/// Returns true if the slots were filled and false if all weights are zero so that the slots
/// cannot be filled.
fn populate_maglev_slots(
    preference: &impl Fn(usize, usize) -> usize,
    target_weights: &[usize],
    target_slots: &mut [usize],
) -> bool {
    if !has_positive_value(target_weights) {
        return false;
    }
    let target_count = target_weights.len();
    let slot_count = target_slots.len();

    // Use target_count as sentinel value for "not set".
    target_slots.fill(target_count);

    // `next[i]` is the index of the next permutation value for target `i`.
    let mut next: Vec<usize> = vec![0; target_count];

    // Fill all slots.
    let mut i = 0;
    let mut k = target_weights[i];
    let mut m = 0;
    while m < slot_count {
        if k > 0 {
            // Look for target i's next slot that is not taken yet.
            let (j, p) = (next[i]..slot_count)
                .map(|j| (j, preference(i, j)))
                .find(|(_, p)| target_slots[*p] == target_count)
                // Value must exist because each permutation contains all slots.
                .unwrap();
            next[i] = j + 1;
            target_slots[p] = i;
            m += 1;

            // Let targets take turns according to their weights.
            k -= 1;
        }
        if k == 0 {
            i = (i + 1) % target_count;
            k = target_weights[i];
        }
    }
    true
}

Each target gets as many turns in a row as its weight. We set k to the weight of the current target i and count down. When we reach zero, we move on to the next target (cyclically). If all weights are 1, the function is equivalent to the previous version without weights. An early exit takes care of the special case of no positive weight (no target available) so that we don't need another check in the main loop.

Let's apply this to our example with different weights to confirm that the algorithms assigns the slots according to the weights.

#[test]
fn test_populate_maglev_slots_with_weights() {
    // Prime slot count guarantees that all positive strides generate all slots.
    let slot_count = 11;
    let target_count = 3;

    let starts = [5, 9, 3];
    let strides = [2, 3, 5];

    let preference =
        |i: usize, j: usize| -> usize { (starts[i] + j * strides[i]) % slot_count };

    let mut slots = vec![0; slot_count];
    {
        let weights = [1, 1, 1];
        assert!(populate_maglev_slots(&preference, &weights, &mut slots));
        assert_eq!(slots, vec![0, 1, 2, 2, 1, 0, 0, 0, 2, 1, 1]);
    }
    {
        let weights = [1, 0, 1];
        assert!(populate_maglev_slots(&preference, &weights, &mut slots));
        assert_eq!(slots, vec![0, 2, 2, 2, 0, 0, 2, 0, 2, 0, 0]);
    }
    {
        let weights = [1, 2, 1];
        assert!(populate_maglev_slots(&preference, &weights, &mut slots));
        assert_eq!(slots, vec![0, 1, 1, 2, 1, 0, 1, 0, 2, 1, 1]);
    }
    {
        let weights = [0, 0, 0];
        assert!(!populate_maglev_slots(&preference, &weights, &mut slots));
    }
}

The results demonstrate indeed the behavior we were looking for. Removing target 1 reassigns its slots to the other targets while the other slots stay the same with the exception of one flip from 0 to 2. Similarly, doubling the weight of target 1 gives it more slots while keeping the other slots of target 0 and 2.

Now that we have implemented the core algorithm, it's time to think about the API. A hash-based target selector computes a target index from an input hash. We have to take into account that the assignment may fail if there is no target with positive weight. Let's create a custom error type for that:

use std::error::Error;
use std::fmt::{Debug, Display, Formatter};

#[derive(Debug)]
enum TargetSelectionError {
    NoTargetsAvailable,
}

impl Display for TargetSelectionError {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        write!(f, "{:?}", self)
    }
}

impl Error for TargetSelectionError {}

With this error type, we can define the trait for the target selector.

/// Selection of targets based on a hash of the input.
trait HashTargetSelector {
    /// Selects a target for an input.
    ///
    /// The input is provided as a hash value, and the target is identified by its zero-based
    /// index
    ///
    /// In case of a load balancer, the input is a packet or request. The hash function is
    /// applied to some characteristic fields of the input (for example the 5-tuple of a TCP or
    /// UDP packet or some HTTP header fields.
    fn select_target(&self, input_hash: usize) -> Result<usize, TargetSelectionError>;
}

Which data does the Maglev implementation of this trait need? We definitely need the target slots. It will also be useful to know the weights. They give us the target count and determine if we can find a target at all.

In the remaining code snippets we will omit the documentation because the functions are either self-explanatory or described in the text.

struct MaglevTargetSelector {
    target_weights: Vec<usize>,
    target_slots: Vec<usize>,
}

impl MaglevTargetSelector {
    fn new(slot_count: usize) -> Self {
        MaglevTargetSelector {
            target_weights: vec![],
            target_slots: vec![0; slot_count],
        }
    }

    fn target_count(&self) -> usize {
        self.target_weights.len()
    }

    fn target_counts(&self) -> Vec<usize> {
        compute_counts(&self.target_slots, self.target_count(), |i| *i)
    }

    fn slot_count(&self) -> usize {
        self.target_slots.len()
    }

    fn targets_available(&self) -> bool {
        has_positive_value(&self.target_weights)
    }

    fn populate(
        &mut self,
        preference: &impl Fn(usize, usize) -> usize,
        weights: &[usize],
    ) -> bool {
        self.target_weights = Vec::from(weights);
        populate_maglev_slots(preference, weights, &mut self.target_slots)
    }
}

impl HashTargetSelector for MaglevTargetSelector {
    fn select_target(&self, input_hash: usize) -> Result<usize, TargetSelectionError> {
        if self.targets_available() {
            Ok(self.target_slots[input_hash % self.slot_count()])
        } else {
            Err(TargetSelectionError::NoTargetsAvailable)
        }
    }
}

fn compute_counts<T>(xs: &[T], size: usize, f: impl Fn(&T) -> usize) -> Vec<usize> {
    let mut counts = vec![0; size];
    for x in xs {
        counts[f(x)] += 1;
    }
    counts
}

The target_counts function returns how many slots each target has. That's going to be handy in the forthcoming tests. Let's first rewrite the initial example in terms of the new structure.

#[test]
fn maglev_selector_populates_with_weights() {
    // Prime slot count guarantees that all positive strides generate all slots.
    let slot_count = 11;

    let starts = [5, 9, 3];
    let strides = [2, 3, 5];

    let preference =
        |i: usize, j: usize| -> usize { (starts[i] + j * strides[i]) % slot_count };

    let mut selector = MaglevTargetSelector::new(slot_count);
    {
        let weights = [1, 2, 1];
        assert!(selector.populate(&preference, &weights));

        assert_eq!(selector.target_slots, vec![0, 1, 1, 2, 1, 0, 1, 0, 2, 1, 1]);
        assert_eq!(selector.select_target(0), Ok(0));
        assert_eq!(selector.select_target(4), Ok(1));
        assert_eq!(selector.select_target(99), Ok(0));
    }
    {
        let weights = vec![0; 3];
        assert!(!selector.populate(&preference, &weights));

        assert_eq!(
            selector.select_target(0),
            Err(TargetSelectionError::NoTargetsAvailable)
        );
    }
}

We have not covered the actual hash functions yet. Fortunately, Rust's hashing API makes this a breeze. For simplicity, we are going to use the default hasher and construct different hash functions by adding an initial seed.

use std::hash::{DefaultHasher, Hash, Hasher};

fn create_hasher(seed: i64) -> impl Hasher {
    let mut hasher: DefaultHasher = DefaultHasher::new();
    hasher.write_i64(seed);
    hasher
}

We need three hash functions, two for the start and stride of the targets to generate the preference lists and another for the input hash. A set of hash functions for different purposes is common enough to deserve its own structure. We define a new HashFactory trait that gives us a Hasher for some hash type. The default implementation keeps a seed for each hash type.

trait HasherFactory<T> {
    fn create_hasher(&self, hash_type: &T) -> impl Hasher;

    fn hash<V: Hash>(&self, hash_type: &T, value: V) -> usize {
        let mut hasher = self.create_hasher(hash_type);
        value.hash(&mut hasher);
        hasher.finish() as usize
    }

    fn hash_with_limit<V: Hash>(&self, hash_type: &T, value: V, limit: usize) -> usize {
        self.hash(hash_type, value) % limit
    }
}

use std::collections::HashMap;

struct DefaultHasherFactory<T: Eq + Hash> {
    seeds: HashMap<T, i64>,
}

impl<T: Eq + Hash> HasherFactory<T> for DefaultHasherFactory<T> {
    fn create_hasher(&self, hash_type: &T) -> impl Hasher {
        create_hasher(self.seeds[hash_type])
    }
}

The hash convenience function applies the hash function of given hash type to an arbitrary value (which has to implement the Hash trait). The hash_with_limit function takes this hash modulo some limit.

The three hash types for the Maglev algorithm are Start, Stride, and Input. We initialize the associated seeds with some arbitrary numbers.

#[derive(PartialEq, Eq, Hash, Copy, Clone)]
enum MaglevHashType {
    Start,
    Stride,
    Input,
}

type MaglevHasherFactory = DefaultHasherFactory<MaglevHashType>;

fn create_maglev_hasher_factory() -> MaglevHasherFactory {
    let seeds: HashMap<MaglevHashType, i64> = HashMap::from([
        (MaglevHashType::Start, 295801983),
        (MaglevHashType::Stride, 918564629583),
        (MaglevHashType::Input, 3857471928),
    ]);
    MaglevHasherFactory { seeds }
}

This allows us to create the preference function for any list of target as long as these target objects are hashable.

fn create_maglev_hash_preference<T: Hash>(
    hasher_factory: &MaglevHasherFactory,
    targets: &[T],
    slot_count: usize,
) -> impl Fn(usize, usize) -> usize {
    let starts: Vec<usize> = targets
        .iter()
        .map(|t| hasher_factory.hash_with_limit(&MeglevHashType::Start, t, slot_count))
        .collect();
    let strides: Vec<usize> = targets
        .iter()
        .map(|t| hasher_factory.hash_with_limit(&MeglevHashType::Stride, t, slot_count - 1) + 1)
        .collect();
    move |i: usize, j: usize| -> usize { (starts[i] + j * strides[i]) % slot_count }
}

Let's add one more test using this whole setup with a list of targets given as strings.

#[test]
fn maglev_assigner_populate_with_hash() {
    let hasher_factory = create_maglev_hasher_factory();
    let slot_count = 65537;

    let targets = vec!["target-1", "target-2", "target-3", "target-4"];
    let target_count = targets.len();
    let weights = vec![1; target_count];

    let preference = create_maglev_hash_preference(&hasher_factory, &targets, slot_count);
    let mut selector = MaglevTargetSelector::new(slot_count);

    assert!(selector.populate(&preference, &weights));
    assert_eq!(selector.target_counts(), vec![16385, 16384, 16384, 16384]);

    assert_eq!(
        selector.select_target(hasher_factory.hash(&MeglevHashType::Input, "some-input")),
        Ok(1)
    );
    assert_eq!(
        selector.select_target(hasher_factory.hash(&MeglevHashType::Input, "another-input")),
        Ok(3)
    );
}

Rust allows us to implement these structures and functions in a generic way without much ceremony. We could apply them without change to more complex target and input objects such as the characteric tuples of packets.