CMake, the programming language

Andreas Hohmann March 31, 2024 #cmake #Linux #c++ #clang #bazel

I remember writing my own Python Makefile generator for C++ and Java projects back in the day and being delighted when Maven emerged as the de-facto standard for Java builds and package management. Other language ecosystems also converged on standard tools such as sbt for Scala or Leiningen for Clojure. New programming languages must have a good build and package management story to be taken seriously. Rust's Cargo and Zig's build system are good examples for this trend.

C++ wasn't so lucky. Even 20 years later and even though we now have a module system in the standard, we have not seen the same kind of convergence as in other programming eco systems, and C++ build tools still present a plephora of choices, each with its own shortcomings.

Gradle, the dominant build system in the Java/Kotlin/Android space, can build C++ projects, but its focus are clearly JVM languages. Google's Bazel is growing fast, but from a small base and while struggling to move beyond its monorepo origins (adopting a new module system as a result) and keep its own complexity under control. Its Starlark build language is now used by other leaner build systems such as Meta's Buck2 and Thought Machine's Please. Zig touts its build system as a solution for C/C++ builds. I'm curious how much adoption these newer tools will see.

This leaves CMake as the closest thing to a standard C++ build tool. According to JetBrain's 2023 survey, about half of all C++ developers use it. Even when not using CMake for our own projects, we have to be able to read CMake files in order to understand other projects and libraries that we might want to use. The Envoy proxy, for example, uses Bazel but also depends on libraries built with CMake. Envoy uses a special CMake Bazel rule (from the rules_foreign_cc project) to compile CMake-based libraries as part of the Envoy build.

This post takes a closer look at the CMake programming language. As we will see, it's indeed a fully fledged programming language rather than a configuration language. This seems to distinguish platform or language-specific build systems from general language-agnostic ones: A language-specific build system such as Cargo can live with a small configuration language like TOML, whereas general-purpose build systems tend towards complete programming languages. These build systems either use a DSL embedded in a general-purpose host language (Groovy or Kotlin in case of Gradle, Scala in case of sbt) or define their own language such as Bazel's Python-derived Starlark, Meson's build language, or the CMake language we are going to study here.

Curiously, CMake is still following the build file generation approach[^1], leaving the heavy-lifting of the actual build execution to other, mostly platform-specific tools such as make, nmake, Visual Studio, and ninja.

First Impressions

At first sight, CMake configuration files look foreign. The main CMake file, for example, is called CMakeLists.txt. Why "Lists"? Why the plural? Why the ".txt" suffix instead of a suffix reflecting the CMake-specific syntax? Pushing these questions aside for a moment, the contents of the most basic CMakeLists.txt file is almost self-explanatory:

cmake_minimum_required(VERSION 3.22)
project(Hello CXX)
add_executable(Hello hello.cpp)

It compiles the minimal c++ "Hello World!" program as expected:

#include <iostream>

int main() {
  std::cout << "Hello World!" << std::endl;
}

The build is performed in two steps: generating the make files and executing them. Both steps can be performed with the cmake command:

cmake -B build
cmake --build build

The -B build option in the first command tells CMake to put all build artifacts including the generated make files into the build directory. This directory is called the build tree or build root. The --build option tells CMake to execute the build, and the following build argument is the build root again.

The minimal CMakeLists.txt file gives us a first glimpse of the "unusual" CMake syntax. Every statement looks like a function with a function name followed by a (possibly empty) list of arguments enclosed in parentheses and separated by whitespace. Whitespace-separated arguments are not unusual, but they usually combined with a function call syntax that drops the parentheses as well (as in shell and many functional languages). These CMake "function calls" are called commands, and a CMake program consists of a list of commands. A file containing CMake code is accordingly called a listfile. The normal suffix is now .cmake, but the main listfile has kept its CMakeLists.txt name. Packages, for example, are described by listfiles named <package-name>-config.cmake.

There are no assignment statements that one would normally expect from a build language (e.g, make variables, sbt settings), but the first line looks like we are setting the VERSION to 3.22.

What about a C++23 version of "Hello World"?

#include <print>

int main() {
  std::println("Hello World!");
}

Clang requires additional flags add libc++ during the compile and link phase to compile this program:

clang-18 -std=c++23 -stdlib=libc++ -lc++ -o hello hello.cpp 

How does this translate to CMake? Setting the c++ version with the CMAKE_CXX_STANDARD variable is well documented:

set(CMAKE_CXX_STANDARD 23)

The C++ compiler can be configured by either setting the CXX environment variable (which CMake honors) or setting the CMAKE_CXX_COMPILER variable:

set(CMAKE_CXX_COMPILER "clang-18")

I was hoping that these settings would compell CMake to choose the necessary compiler and linker flags, but this was not the case. However, one can add these flags with add_compile_options and add_link_options. This leads to the following CMake equivalent of the plain clang-18 call above:

cmake_minimum_required(VERSION 3.22)

set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_COMPILER "clang-18")
add_compile_options("-stdlib=libc++")
add_link_options("-lc++")

project(Hello)
add_executable(Hello hello.cpp)

Note that this solution is not platform independent. I suspect that there is a better option to solve this problem.

Hello CMake

This very basic example of a C++ build definition gave us a first glimpse of the CMake language. It surely looks unusal and resembles neither a typical configuration language (JSON, TOML, and the likes) nor any common programming language.

What makes this language with its unusual syntax tick? Let's start with the obligatory first program for any programming language and create a hello.cmake file with the following command:

message("Hello World!")

When running this CMake script with cmake -P hello.cmake, CMake will indeed print "Hello World!" to the console.

The second program in any programming language is often some arithmetic computation, possibly combined with a conditional statement checking the result. This is where CMake starts showing its character:

math(EXPR sum "1 + 1")
if(sum EQUAL 2)
  message("I can add, the sum is ${sum}")
else()
  message("I cannot add")
endif()

This program will print "I can add, the sum is 2" as expected, but why didn't we place the expression directly in the condition? Couldn't we write something like if(1 + 1 == 2) or at least if(math(1+1) EQUAL 2)?

CMake commands are pure statements, have no return value, and therefore cannot be used as arguments to other functions. There are no "expressions" in the usual sense. In order to use the result of the addition 1 + 1, we have to use the math function to evaluate the expression provided as a string and assign its result to a variable (here sum). After that, we can compare the value of this variable using the EQUAL operator, again provided as separate argument. In general, we can only compute and use a value by assigning it to a variable (passed by name to the computing function) and then looking at this variable using string interpolation as in ${sum}.

Functional programmers may be tempted to call CMake a pure non-functional language because it has no pure functions whereas functional languages have only pure functions. But CMake is fully "functional" in the sense that it works (and is turing-complete) so that "pure side-effect language" may be the more appropriate characterization.

To get familiar with the other crucial aspect of the CMake language, let's walk through the following program which prints "There are 6 names":

1set(names "Joe;Mary" John)
2list(APPEND names "Lisa" Alice;Bob)
3set(foo name)
4set(bar oo)
5list(LENGTH "${f${bar}}s" names_length)
6message("There are ${names_length} names")

To understand this program one has to know that all values in CMake are stored as strings. These string are interpreted according to the context. A list, for example, is stored as a string formed by joining the list items and using a semicolon as a separator. If one of the items happens to include a semicolon, this semicolon becomes a separator. The set command takes the name of the variable to be set as the first argument and interprets the remaining arguments as a list that is stored in the variable. The command in line 1 therefore stores the string "Joe;Mary;John" in the variable names. Interpreted as a list, this string stores the three names "Joe", "Mary", and "John" because the semicolon between Joe and Mary becomes a separator.

The list command in line 2 uses its first argument, APPEND, as a subcommand and interprets the second argument as the name of (list) variable. The APPEND subcommand mutates this list by appending one or more elements, again using the semicolon as a separator. We therefore end up with the list string "Joe;Mary;John;Lisa;Alice;Bob" which can be interpreted as a list of 6 elements.

Note that we sometimes use double quotes to enclose a string argument and sometimes don't. It does not make a difference unless the string contains whitespace.

But what about "${f${bar}}s"? CMake performs recursive string interpolation, using the common ${...} placeholder syntax, when evaluating arguments. The variable bar is set to oo. The string "f${bar}" therefore evaluates to "foo" so that "${f${bar}}s" becomes "${foo}s". But the variable foo has the value "name" so that the whole string "${f${bar}}s" evaluates to "names" which is the variable name ultimately passed to the LENGTH subcommand. The LENGTH subcommand of the list command correctly determines that the list in the "names" variable has length 6 and stores this value in the names_length variable.

Note that an unquoted string is interpreted as a variable name or as a plain string depending on the position.

set(x 10)
set(y x)
message("x: ${x}")
message("y: ${y}")
set("${y}" foo)
message("x: ${x}")
x: 10
y: x
x: foo

These simple examples reveal some common patterns of CMake commands. Arguments can be interpreted as values, keywords, or variables names. Keywords are by convention capitalized. They control the meaning of the following arguments. All string arguments (whether they are interpreted as variable names, values, or keywords) are subject to string interpolation.

The following example demonstrates the string interpolation for all three types of arguments and prints "There are 3 names":

1set(names "Joe;Mary" John)
2set(var names)
3set(cmd LENGTH)
4set(len "${var}_length")
5list(${cmd} ${var} ${len})
6message("There are ${names_length} names")

Note in particular that the name names_length of the variable that the length is stored in is constructed dynamically in line 4 by concatenating names (stored in var) and _length.

Hence, we could characterize CMake as a pure side-effect string-oriented language.

A few more things to keep in mind:

Truthiness is determined differently depending on whether the value is an unquoted variable name or not. An unquoted variable is considered false if and only if it is empty, OFF, NO, N, FALSE, IGNORE, NOTFOUND, 0, or ends with -NOTFOUND. Any other value is considered true if and only if it is ON, YES, Y, TRUE, or a non-zero number.

Loops

In comparison to the features we have seen so far, loops look almost conventional. The foreach command takes the iteration variable as the first argument following by additional argument describing the range or lists to loop through. The most common foreach loop iterates through the items of a list:

set(names "Joe" "Mary" "John")
foreach(name IN LISTS names)
  message("name: ${name}")
endforeach()

It is possible to specify multiple lists and additional items. foreach will loop through all the elements as if the lists were concatenated and the items appended.

set(names "Joe" "Mary" "John")
set(cities "Boston" "Berlin" "Barcelona")
foreach(name IN LISTS names cities ITEMS "foo" "bar")
  message("name: ${name}")
endforeach()
item: Joe
item: Mary
item: John
item: Boston
item: Berlin
item: Barcelona
item: foo
item: bar

A range loop takes RANGE as the second argument followed by the optional start and the end index (both inclusive). If the start index is missing, the loop starts at 0. The following example prints the numbers from 1 to 5 followed by the numbers from 0 to 4:

foreach(i RANGE 1 5)
  message("i: ${i}")
endforeach()
foreach(i RANGE 4)
  message("i: ${i}")
endforeach()

There is even a "zip" syntax to loop through multiple lists in parallel:

set(names "Joe" "Mary" "John")
list(LENGTH names len)
foreach(i RANGE 1 ${len})
  list(APPEND indexes ${i})
endforeach()
foreach(item IN ZIP_LISTS indexes names)
  message("name ${item_0}: ${item_1}")
endforeach()

This program prints

name 1: Joe
name 2: Mary
name 3: John

As an alternative, we could use a while loop:

set(i 0)
while(i LESS len)
  list(GET names ${i} name)
  math(EXPR i "${i} + 1")
  message("name ${i}: ${name}")
endwhile()

foreach and while loop also support break and continue.

Functions

CMake offers two ways to define reusable code blocks: macros and functions. Macros perform textual substitution similar to C preprocessor macros where functions resemble functions in other programming languages with their own scope and a return command to return the program flow back to the caller.

function(print_list items)
  foreach(item IN LISTS items)
    message("item: ${item}")
  endforeach()
endfunction()

set(names "Joe" "Mary" "John")
print_list("${names}")

Note that the list has to be passed as an interpolated string. Calling print_list(names) is interpreting names as a string (a list containing the string item "names"), and calling print_list(${names}) passes three (space separated) arguments to the print_list function which interprets the first argument as a list containing the single item "Joe".

Functions have their own variable frame or "scope". Logically one can imagine that CMake copies the dictionary of variables before calling a function and lets the function operate on this new dictionary when setting or updating variables. When reading a variable, cmake will first look in the current scope. If it cannot find the variable name in the current scope, it looks in the parent scope, that is, the scope of the caller. This process continues until the variable is found or there is no parent scope. If a variable is not found, an empty string is used silently. CMake does not raise an error for a missing variable.

set(x 10)
function(bar y)
  message("bar x: ${x}")
  set(x ${y})
  message("bar x: ${x}")
endfunction()

function(foo)
  message("foo x: ${x}")
  set(x 30)
  bar(20)
  message("foo x: ${x}")
endfunction()

message("main x: ${x}")
foo()
message("main x: ${x}")
main x: 10
foo x: 10
bar x: 30
bar x: 20
foo x: 30
main x: 10

A function can manipulate variables in the scope of its caller, the parent scope, by adding the PARENT_SCOPE keyword to the set command. The following function double the value stored in a variable.

function(double var)
  math(EXPR result "2 * ${${var}}")
  set(${var} ${result} PARENT_SCOPE)
endfunction()

set(x 10)
double(x)
message("double(x) = ${x}")
double(x) = 20

In a build script (CMakeLists.txt), the variable lookup also considers the so-called Cache which is a set of configuration variables that can be set separately (from the CMake UI or the command line).

Conclusion

The combination of these features is surprisingly flexible and powerful.

CMake does not have a built-in dictionary structure, for example, but we can easily build one ourselves. Variables are stored in some internal dictionary and we can create variable names dynamically. This is all we need to implement an API that looks like a dictionary.

function(map_set map key value)
  set("${map}_${key}" ${value} PARENT_SCOPE)
endfunction()

function(map_get map key result)
  set(${result} "${${map}_${key}}" PARENT_SCOPE)
endfunction()

map_set(prices apple 1.50)
map_set(prices pear 1.75)
map_get(prices apple price)
message("An apple costs ${price}")

Of course we are using global variables and name wrangling with all their dangers, but it's possible to implement arbitrarily complex data structures this way.

I hope that this short overview gives you an idea of the language that underpins the CMake build system. Knowing this language and its quirks not only explains the patterns in the CMake build language, but will also allow you to create your own extensions.

[^1] Google's blaze/bazel build tool also dates back to a Python Makefile generator which explains the syntax of bazel's Skylark language.