Zig structs of arrays

Andreas Hohmann May 04, 2024 #software #zig #data oriented programming #comptime

If you want to see the power of Zig's comptime, look no further than MultiArrayList. This collection stores the data of a list of structs in struct of array (SoA) rather than an array of structs (AoS). This technique is a staple of data-oriented design and array-oriented programming (long live APL) as used in high-performance applications such as game engines, scientific computing, and compilers. The latter is probably the reason why it already exists in Zig's standard library (see Andrew Kelley's Practical Data-Oriented Design presentation).

Generating a struct-of-arrays type for a struct is a non-trivial type manipulation that I expect from type-oriented languages such as TypeScript or Scala, but not from a low-level system programming language such as Zig. How does Zig pull this off? It turns out that it's "just" compile-time execution and reflection.

Types in Zig are compile time values. They can be assigned to constants, passed as arguments to (compile-time) functions, and returned by those functions. This is already visible in the syntax for type definitions. A named struct type, for example, is defined by initializing a constant with an anonymous struct declaration.

1const Token = struct {
2 kind: enum { id, string, number },
3 data: []const u8,
4};
5
6test "token size" {
7 try testing.expectEqual(24, @sizeOf(Token));
8 try testing.expectEqual(2400, @sizeOf([100]Token));
9}

As the test shows, this structure needs 24 bytes on my 64 bit machine, 2*8 bytes for the data slice (pointer and length) and 8 bytes for the kind enum and alignment. An array of 100 Token structs needs 2400 bytes.

The MultiArrayList cuts the memory needed down to 1700 bytes: 1600 bytes for the data slices and 100 bytes for the kinds. The test below demonstrates this by using a fixed buffer allocator with exactly 1704 bytes (I don't know why the 4 extra bytes are needed).

1const TokenList = std.MultiArrayList(Token);
2
3test "token lists" {
4 var buffer: [1704]u8 = undefined;
5 var fba = std.heap.FixedBufferAllocator.init(&buffer);
6 const allocator = fba.allocator();
7
8 var tokens = TokenList{};
9 try tokens.setCapacity(allocator, 100);
10 defer tokens.deinit(allocator);
11
12 tokens.appendAssumeCapacity(.{ .kind = .number, .data = "1000" });
13 try testing.expectEqual(Token{ .kind = .number, .data = "1000" }, tokens.get(0));
14}

How does this work under the hood? Types are passed to a function using a compile-time (comptime) parameter of type type. This is the foundation for Zig's generic types such as the collections in the standard library. Here is a minimal version of an array list of fixed size (see array_list.zig for the real thing).

1pub fn FixedArrayList(comptime T: type) type {
2 return struct {
3 const Self = @This();
4
5 items: []T,
6 allocator: Allocator,
7
8 pub fn init(allocator: Allocator, length: usize) Allocator.Error!Self {
9 return .{
10 .allocator = allocator,
11 .items = try allocator.alloc(T, length),
12 };
13 }
14
15 pub fn deinit(self: *Self) void {
16 self.allocator.free(self.items.ptr[0..self.items.len]);
17 }
18 };
19}
20
21test "allocates array list" {
22 const allocator = testing.allocator;
23
24 const n = 10;
25 const PointList = FixedArrayList(Point);
26 var points = try PointList.init(allocator, n);
27 defer points.deinit();
28
29 points.items[5] = .{ .x = 10, .y = 20 };
30 try testing.expectEqual(20, points.get(5).y);
31}

This is not a very useful structure as it only wraps a slice that one could use in most cases directly, but it demonstrates a simple type generator function. The type argument T is only used once for the type of the items slice, and we don't need to know anything about the inner structure of T.

To construct the struct-of-arrays for a given struct, we have to go a step further. Our type construction function must be able to look at the original struct's fields and their types to construct the new struct-of-arrays type. Zig offers a compile-time reflection API for this purpose.

To get a feel for this API, let's define a type constructor for points with explicit coordinates x1, x2, x3, and so forth. Our type function takes the type T of the coordinates and the dimension n and returns a struct with one field for each of the dimensions.

1pub fn PointN(comptime T: type, comptime N: comptime_int) type {
2 var fields: [N]std.builtin.Type.StructField = undefined;
3 for (0..N) |i| {
4 var num_buf: [128]u8 = undefined;
5 fields[i] = .{
6 .name = std.fmt.bufPrintZ(&num_buf, "x{d}", .{i + 1}) catch unreachable,
7 .type = T,
8 .default_value = null,
9 .is_comptime = false,
10 .alignment = @alignOf(T),
11 };
12 }
13 return @Type(.{
14 .Struct = .{
15 .is_tuple = false,
16 .layout = .Auto,
17 .decls = &.{},
18 .fields = &fields,
19 },
20 });
21}
22
23test "n-dimensional point" {
24 const P3 = PointN(i32, 3);
25 const p = P3{ .x1 = 10, .x2 = 20, .x3 = 30 };
26 try expectEqual(p.x2, 20);
27}

We construct the struct type dynamically using the @Type function. The metadata of the fields is an array of StructField objects. We are still not using the inner structure of the type T (besides its alignment obtained with @alignOf), but we can see what kind of metadata Zig is using to describe structs and fields. The final missing piece is a way to obtain this metadata for a given type. Zig's @typeInfo function does just that.

The MultiArrayList stores the data in a single byte array. This array is the concatenation of the arrays for the fields of the original structure. To optimize alignment, these sub-arrays are included in decending order of their alignment (the alignment of the corresponding field). The generated struct-of-arrays type keeps the field sizes and field index permutation (due to the sorting) in the sizes structure. This structure is used to compute the overall allocation size (capacityInBytes) and the offsets for the field slices.

The rest of the MultiArrayList code implements the various methods for adding and removing items, resizing, and conversion to a slice of structs. Considering the low-level pointer and index gymnastics, it is fairly readable.

Zig's compile-time type construction has its limits. It's not possible, for example, to generate methods dynamically, in contrast to the field construction that we used for the PointN structure. However, this is a current restriction of the reflection API and not a fundamental issue of the type generation functions, and this restriction may be lifted in the future. The main benefit of the comptime function approach is that we don't have to learn a new language (such as Rust's procedural macros), even though we do have to learn the reflection API.