CMake, the programming language
Andreas Hohmann March 31, 2024 #cmake #Linux #c++ #clang #bazelI remember writing my own Python Makefile generator for C++ and Java projects back in the day and being delighted when Maven emerged as the de-facto standard for Java builds and package management. Other language ecosystems also converged on standard tools such as sbt for Scala or Leiningen for Clojure. New programming languages must have a good build and package management story to be taken seriously. Rust's Cargo and Zig's build system are good examples for this trend.
C++ wasn't so lucky. Even 20 years later and even though we now have a module system in the standard, we have not seen the same kind of convergence as in other programming eco systems, and C++ build tools still present a plephora of choices, each with its own shortcomings.
Gradle, the dominant build system in the Java/Kotlin/Android space, can build C++ projects, but its focus are clearly JVM languages. Google's Bazel is growing fast, but from a small base and while struggling to move beyond its monorepo origins (adopting a new module system as a result) and keep its own complexity under control. Its Starlark build language is now used by other leaner build systems such as Meta's Buck2 and Thought Machine's Please. Zig touts its build system as a solution for C/C++ builds. I'm curious how much adoption these newer tools will see.
This leaves CMake as the closest thing to a standard C++ build tool. According to JetBrain's 2023 survey, about half of all C++ developers use it. Even when not using CMake for our own projects, we have to be able to read CMake files in order to understand other projects and libraries that we might want to use. The Envoy proxy, for example, uses Bazel but also depends on libraries built with CMake. Envoy uses a special CMake Bazel rule (from the rules_foreign_cc project) to compile CMake-based libraries as part of the Envoy build.
This post takes a closer look at the CMake programming language. As we will see, it's indeed a fully fledged programming language rather than a configuration language. This seems to distinguish platform or language-specific build systems from general language-agnostic ones: A language-specific build system such as Cargo can live with a small configuration language like TOML, whereas general-purpose build systems tend towards complete programming languages. These build systems either use a DSL embedded in a general-purpose host language (Groovy or Kotlin in case of Gradle, Scala in case of sbt) or define their own language such as Bazel's Python-derived Starlark, Meson's build language, or the CMake language we are going to study here.
Curiously, CMake is still following the build file generation approach[^1], leaving the heavy-lifting of the actual build execution to other, mostly platform-specific tools such as make, nmake, Visual Studio, and ninja.
First Impressions
At first sight, CMake configuration files look foreign. The main CMake file,
for example, is called CMakeLists.txt. Why "Lists"? Why
the plural? Why the ".txt" suffix instead of a suffix reflecting the
CMake-specific syntax? Pushing these questions aside for a moment, the contents
of the most basic CMakeLists.txt
file is almost self-explanatory:
It compiles the minimal c++ "Hello World!" program as expected:
int
The build is performed in two steps: generating the make files and executing them. Both steps can be performed with the cmake command:
cmake -B build
cmake --build build
The -B build
option in the first command tells CMake to put all build
artifacts including the generated make files into the build
directory. This
directory is called the build tree or build root. The --build
option
tells CMake to execute the build, and the following build
argument is the
build root again.
The minimal CMakeLists.txt
file gives us a first glimpse of the "unusual"
CMake syntax. Every statement looks like a function with a function name
followed by a (possibly empty) list of arguments enclosed in parentheses and
separated by whitespace. Whitespace-separated arguments are not unusual, but
they usually combined with a function call syntax that drops the parentheses as
well (as in shell and many functional languages). These CMake "function calls"
are called commands, and a CMake program consists of a list of commands. A
file containing CMake code is accordingly called a listfile. The normal
suffix is now .cmake
, but the main listfile has kept its CMakeLists.txt
name. Packages, for example, are described by listfiles named
<package-name>-config.cmake
.
There are no assignment statements that one would normally expect from a build
language (e.g, make variables, sbt settings), but the first line looks like we
are setting the VERSION
to 3.22
.
What about a C++23 version of "Hello World"?
int
Clang requires additional flags add libc++ during the compile and link phase to compile this program:
clang-18 -std=c++23 -stdlib=libc++ -lc++ -o hello hello.cpp
How does this translate to CMake? Setting the c++ version with the
CMAKE_CXX_STANDARD
variable is well documented:
The C++ compiler can be configured by either setting the CXX
environment
variable (which CMake honors) or setting the
CMAKE_CXX_COMPILER
variable:
I was hoping that these settings would compell CMake to choose the necessary
compiler and linker flags, but this was not the case. However, one can add
these flags with add_compile_options
and
add_link_options
. This leads to the following CMake
equivalent of the plain clang-18
call above:
Note that this solution is not platform independent. I suspect that there is a better option to solve this problem.
Hello CMake
This very basic example of a C++ build definition gave us a first glimpse of the CMake language. It surely looks unusal and resembles neither a typical configuration language (JSON, TOML, and the likes) nor any common programming language.
What makes this language with its unusual syntax tick? Let's start with the
obligatory first program for any programming language and create a
hello.cmake
file with the following command:
When running this CMake script with cmake -P hello.cmake
, CMake will indeed
print "Hello World!" to the console.
The second program in any programming language is often some arithmetic computation, possibly combined with a conditional statement checking the result. This is where CMake starts showing its character:
This program will print "I can add, the sum is 2" as expected, but why didn't
we place the expression directly in the condition? Couldn't we write something
like if(1 + 1 == 2)
or at least if(math(1+1) EQUAL 2)
?
CMake commands are pure statements, have no return value, and therefore cannot
be used as arguments to other functions. There are no "expressions" in the
usual sense. In order to use the result of the addition 1 + 1, we have to use
the math
function to evaluate the expression provided as a string and assign
its result to a variable (here sum
). After that, we can compare the value of
this variable using the EQUAL
operator, again provided as separate argument.
In general, we can only compute and use a value by assigning it to a variable
(passed by name to the computing function) and then looking at this variable
using string interpolation as in ${sum}
.
Functional programmers may be tempted to call CMake a pure non-functional language because it has no pure functions whereas functional languages have only pure functions. But CMake is fully "functional" in the sense that it works (and is turing-complete) so that "pure side-effect language" may be the more appropriate characterization.
To get familiar with the other crucial aspect of the CMake language, let's walk through the following program which prints "There are 6 names":
1
2
3
4
5
6
To understand this program one has to know that all values in CMake are stored
as strings. These string are interpreted according to the context. A list, for
example, is stored as a string formed by joining the list items and using a
semicolon as a separator. If one of the items happens to include a semicolon,
this semicolon becomes a separator. The set
command takes the name of the
variable to be set as the first argument and interprets the remaining arguments
as a list that is stored in the variable. The command in line 1 therefore
stores the string "Joe;Mary;John" in the variable names
. Interpreted as a
list, this string stores the three names "Joe", "Mary", and "John" because the
semicolon between Joe and Mary becomes a separator.
The list
command in line 2 uses its first argument, APPEND
,
as a subcommand and interprets the second argument as the name of (list)
variable. The APPEND
subcommand mutates this list by appending one or more
elements, again using the semicolon as a separator. We therefore end up with
the list string "Joe;Mary;John;Lisa;Alice;Bob" which can be interpreted as a
list of 6 elements.
Note that we sometimes use double quotes to enclose a string argument and sometimes don't. It does not make a difference unless the string contains whitespace.
But what about "${f${bar}}s"? CMake performs recursive string interpolation,
using the common ${...}
placeholder syntax, when evaluating arguments. The
variable bar
is set to oo
. The string "f${bar}" therefore evaluates to
"foo" so that "${f${bar}}s" becomes "${foo}s". But the variable foo
has the
value "name" so that the whole string "${f${bar}}s" evaluates to "names" which
is the variable name ultimately passed to the LENGTH
subcommand. The LENGTH
subcommand of the list
command correctly determines that the list in the
"names" variable has length 6 and stores this value in the names_length
variable.
Note that an unquoted string is interpreted as a variable name or as a plain string depending on the position.
x: 10
y: x
x: foo
These simple examples reveal some common patterns of CMake commands. Arguments can be interpreted as values, keywords, or variables names. Keywords are by convention capitalized. They control the meaning of the following arguments. All string arguments (whether they are interpreted as variable names, values, or keywords) are subject to string interpolation.
The following example demonstrates the string interpolation for all three types of arguments and prints "There are 3 names":
1
2
3
4
5
6
Note in particular that the name names_length
of the variable that the length
is stored in is constructed dynamically in line 4 by concatenating names
(stored in var
) and _length
.
Hence, we could characterize CMake as a pure side-effect string-oriented language.
A few more things to keep in mind:
- The command names are case insensitive, but the convention is to use lowercase.
- Variable names are case sensitive.
- Conditions in
if
andwhile
commands support the logical operatorsNOT
,AND
, andOR
and grouping with parentheses. - String comparisons must use the
STR
prefix, for example,STRLESS
. MATCHES
matches a string (left hand side) against a regular expression (right hand side).- Conditions support many special operations such as
EXISTS
,IS_DIRECTORY
, andIS_NEWER_THAN
.
Truthiness is determined differently depending on whether the value is an
unquoted variable name or not. An unquoted variable is considered false if
and only if it is empty, OFF
, NO
, N
, FALSE
, IGNORE
, NOTFOUND
, 0, or
ends with -NOTFOUND
. Any other value is considered true if and only if it is
ON
, YES
, Y
, TRUE
, or a non-zero number.
Loops
In comparison to the features we have seen so far, loops look almost
conventional. The foreach
command takes the iteration
variable as the first argument following by additional argument describing the
range or lists to loop through. The most common foreach
loop
iterates through the items of a list:
It is possible to specify multiple lists and additional items.
foreach
will loop through all the elements as if the lists
were concatenated and the items appended.
item: Joe
item: Mary
item: John
item: Boston
item: Berlin
item: Barcelona
item: foo
item: bar
A range loop takes RANGE
as the second argument followed by the optional
start and the end index (both inclusive). If the start index is missing, the
loop starts at 0. The following example prints the numbers from 1 to 5 followed
by the numbers from 0 to 4:
There is even a "zip" syntax to loop through multiple lists in parallel:
This program prints
name 1: Joe
name 2: Mary
name 3: John
As an alternative, we could use a while
loop:
foreach
and while
loop also support break
and continue
.
Functions
CMake offers two ways to define reusable code blocks: macros and functions.
Macros perform textual substitution similar to C preprocessor macros where
functions resemble functions in other programming languages with their own
scope and a return
command to return the program flow back to the caller.
Note that the list has to be passed as an interpolated string. Calling
print_list(names)
is interpreting names
as a string (a list containing the
string item "names"), and calling print_list(${names})
passes three (space
separated) arguments to the print_list
function which interprets the first
argument as a list containing the single item "Joe".
Functions have their own variable frame or "scope". Logically one can imagine that CMake copies the dictionary of variables before calling a function and lets the function operate on this new dictionary when setting or updating variables. When reading a variable, cmake will first look in the current scope. If it cannot find the variable name in the current scope, it looks in the parent scope, that is, the scope of the caller. This process continues until the variable is found or there is no parent scope. If a variable is not found, an empty string is used silently. CMake does not raise an error for a missing variable.
main x: 10
foo x: 10
bar x: 30
bar x: 20
foo x: 30
main x: 10
A function can manipulate variables in the scope of its caller, the parent
scope, by adding the PARENT_SCOPE
keyword to the set
command. The following
function double the value stored in a variable.
double(x) = 20
In a build script (CMakeLists.txt
), the variable lookup also considers the
so-called Cache which is a set of configuration variables that
can be set separately (from the CMake UI or the command line).
Conclusion
The combination of these features is surprisingly flexible and powerful.
CMake does not have a built-in dictionary structure, for example, but we can easily build one ourselves. Variables are stored in some internal dictionary and we can create variable names dynamically. This is all we need to implement an API that looks like a dictionary.
Of course we are using global variables and name wrangling with all their dangers, but it's possible to implement arbitrarily complex data structures this way.
I hope that this short overview gives you an idea of the language that underpins the CMake build system. Knowing this language and its quirks not only explains the patterns in the CMake build language, but will also allow you to create your own extensions.
[^1] Google's blaze/bazel build tool also dates back to a Python Makefile generator which explains the syntax of bazel's Skylark language.