Infrastructure as data

Andreas Hohmann March 30, 2024 #infrastructure #data model #configuration

We say "infrastructure as code", but I think what we really mean is "infrastructure as data". Let me explain.

What comes to mind when we think of "infrastructure as code"? What are the benefits we want to gain from the "as code" part? I first think of the way we handle code using version control and continuous integration. The version control system gives us global atomic changes, history, test and approval workflows, and the ability to trigger downstream processes based on the changes. We can easily compare the state of an object (a file) between different versions and answer when, by whom, and even why (if the commit messages are good) a change was made.

What we don't think as much about is the code itself. Code consisting of semi-structured pieces of text that need some of the most sophisticated software programs on this planet (compilers) to make sense of. Code written in a multitude of different and evolving languages, many of which requiring years to master. Programming languages that let us express the same thing is dozens of ways while at the same time lacking basic means to model strict relationships between objects.

Unfortunately, "infrastructure as code" gives us both, the advantages of version control, automated workflows, and continous integration as well as the disadvantages of (many) text-based languages to capture data.

The flexibility of programming languages is useful and needed when writing software, that is, when defining new abstractions and applications. It's not as good a fit when modeling real objects in a fixed domain such as infrastructure.

No matter the configuration language, there will be a lot of repetition. Most configuration languages are hierarchical and denormalized. This opens the door for local modifications which eventually prevents us from normalizating the data even if we wanted to (who remembers why this field was changed just in this one case?). To work around the problems of hierarchical data representations, some configuration languages allow for references, mutable data, imperative statements, and functions, sometimes turning the confirguration languages into turing-complete programming languages or embedding them in "real" programming languages. This resulting modest improvement in reusability is largely offset by the increase in complexity.

Another symptom of this inherent mismatch between data and hierarchical text is the tendency to generate configuration files from other sources. While adding a level of indirection is the solution to all software engineering problems, we should always ask ourselves what we want to hide with this extra layer. In many cases it may be more efficient in the long run to fix the underlying problem and avoid the complexity of another layer.

We know how to model real objects and their attributes and relationships: Normalize the data to keep it small, clean, and efficient to use (by humans and machines). Encode requirements such as data dependencies in the schema so that invalid data cannot be stored. Visualize the schema and data to arrive at a shared understanding.

I therefore suggest that we shift our perspective and think of infrastructure as data. Let's focus our energy on defining a shared model of infrastructure entities and their relationships instead of inventing yet another syntax for text files representing hierarchical objects. Let's draw some data diagrams to depict and discuss these models.

I don't have a favorite tool for storing and managing this infrastructure data, but the requirements are clear: We need most of the features of a "traditional" database plus support for versioning (including comparison of versions), auditing, and workflows (both before and after a change is committed). As in version control (and financial) systems, the data becomes immutable once committed and can only be changed by adding a new version. These requirements can be implemented with a general purpose database system, but there are also many specialized solutions to choose from.

But before worrying about the tools, let's think of infrastructure as data (IaD) rather than code.