Integrating a flexible code generator into CMake

published: 26 February 2018 • tags:

There’s a problem with CMake and code generators that keeps coming up from time to time. I’m talking about the kind of code generators where you know their list of output files only after they are done executing. The generally agreed upon rumour at my place of work and on the web is that integrating such a generator into CMake is a huge pain.

It turns out, the rumour isn’t quite accurate. And here’s why.

Code generators and the CMake build process

From a CMake centric point of view there are two types of code generators:

Those that produce a predefined, fixed list of output files. Let’s call them fixed generators.
The interesting ones where you know the list of output files only after they run. Let’s call them flexible generators.

Fixed generators are easy. Throw add_custom_target() or add_custom_command() at them and you’re done. For flexible generators the important questions are when and how you obtain the list of output files.

Let’s quickly recap the CMake build process. CMake is a meta build system. It does not build software itself, but generates configuration files for an actual build system like Make or Ninja. This is the configuration phase. Afterwards you invoke the build system to actually build the software. This is the build phase. The complete list of input files for the build phase – source code, resource files, documentation sources, etc. – must be known before the build starts.

That brings us back to flexible code generators. Obviously they cannot run in the build phase. You’d get the list of output files too late. That also means add_custom_target() and add_custom_command() aren’t useful because they’re part of the build phase.

A flexible code generator must run in the configuration phase. Additionally you need to set up dependencies to the generator’s input files and the implementation files of the generator itself. Otherwise when changing any of those files and simply calling your build tool again, no re-run of the CMake configuration including the generator would be triggered.

I tried all the following approaches with Make and Ninja. CMake’s IDE project generators (e.g. the one for Visual Studio) work quite differently, so you might run into trouble with those.

Setting up a CMake project for a flexible code generator

Imagine a C++ project with a code generator implemented in Python that takes a bunch of XML files as input (the model) and produces C++ source code (.hpp and .cpp files) as its output.

The main CMakeLists.txt file starts as usual:

project(flexible-codegen-demo CXX)
cmake_minimum_required(VERSION 3.10)

I tested all of this with CMake 3.10 on an Arch Linux machine. All the CMake commands used are available in all older CMake 3.x versions as well, but not in 2.x.

Tracking dependencies

CMake has a directory property called CMAKE_CONFIGURE_DEPENDS. It’s a list of files that act as dependencies for the generation phase. If one of these files changes and you run your build tool again, a reconfigure is triggered automatically. This property is the key for getting the same change detection behaviour as for targets in the build phase.

The relevant part of the CMakeLists.txt looks as follows:

set(codegen_impl
    codegen/codegen.py
    codegen/xml_reader.py
    codegen/cpp_writer.py
)

set(codegen_model
    model/miaow.xml
    model/mooo.xml
    model/wuff.xml
)

set_property(DIRECTORY APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS
    ${codegen_impl} ${codegen_model}
)

In everyday usage you treat the two set() statements the same as any other file list in an add_executable() or add_library(). They are just a bunch more source files.

Note the APPEND when setting the property. It makes sure existing configure dependencies for this directory are not overwritten.

Obtaining the list of generated files

This is a bit trickier than dependency tracking because the best approach depends on your code generator. I can think of three ways to obtain the list of generated files:

Globbing a directory tree reserved for generated files.
Storing the file list in an intermediate file.
Piping the file list from the generator’s stdout directly into a CMake variable.

To run the generator all of these methods use execute_process(), CMake’s command to run an external process in the configuration phase. For the fictitious generator in this post assume that it takes an output directory with the -o option as well as a list of model input files.

If you don’t have enough control over the output of your code generator and need to do post-processing to get a suitable file list consider using an external scripting language instead of doing everything in CMake. The CMake language is quite powerful but can get painful quickly. It’s perfectly legitimate to have either multiple consecutive execute_process() commands or even multiple COMMAND sections inside a single execute_process(). See the CMake documentation for details.

Globbing

This approach is simple, pragmatic and often enough all you need. You generate and then glob the output directory tree to get the list of generated files.

Globbing is done with CMake’s file(GLOB) or file(GLOB_RECURSE). The usual advice against using file globbing does not apply in this case. It’s discouraged for normal source file lists because added files aren’t tracked automatically and you are forced to re-run CMake manually. But the list of generated files is entirely controlled by the generator implementation and model – and those files are tracked explicitly.

The relevant part of CMakeLists.txt looks like this:

execute_process(
    COMMAND python codegen/codegen.py
            -o ${CMAKE_BINARY_DIR}/generated ${codegen_model}
    WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
)

file(GLOB_RECURSE ${generated_files}
     ${CMAKE_BINARY_DIR}/generated *.hpp *.cpp)

The two commands are executed one after the other. First the generator runs and creates files in the subdirectory generated of the build directory. Then file() globs recursively over that directory and stores the paths to all .hpp and .cpp files in the variable generated_files.

CMake does not need to know about the header files to build the software. But they do not hurt either and some IDEs only show them if they are part of a target.

Intermediate file

Your code generator creates files anyway, doesn’t it? So it’s not too much of a stretch to imagine that you might have a way to generate a file containting a list of all the generated files’ paths. And if that’s possible, why not generate a snippet of CMake code that puts that list into a variable? Then you can include() that snippet and have immediate access to the file list. Yep, at configuration time CMake can dynamically create part of its own configuration.

The relevant part of CMakeLists.txt looks like this:

execute_process(
    COMMAND python codegen/codegen.py -o . ${codegen_model}
    WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
)

include(generated_files.cmake)

The two statements are executed strictly sequential. CMake does not look for the include file until the generator is finished executing, and so picks up the fresh list of source files.

This could be generated_files.cmake created by the code generator:

set(generated_files
    src/cat.hpp src/cat.cpp
    src/cow.hpp src/cow.cpp
    src/dog.hpp src/dog.cpp
)

Remember the rules of CMake’s include mechanism. The included file becomes part of the including file’s scope. Most importantly that means relative paths in generated_files.cmake are relative to the location of CMakeLists.txt.

Piping via stdout

This is a variation of using an intermediate file, just without the file. It pipes the generator’s stdout channel directly into a CMake variable. The output must use CMake’s list format, i.e. strings separated by semicolons.

Overall I like the intermediate file approach better. With piping you need absolute control over what is output on stdout. Nothing else than file paths and semicolon separators are allowed. Depending on your setup that could be hard to guarantee.

Anyway, here’s the snippet from CMakeLists.txt:

execute_process(
    COMMAND python codegen/codegen.py -o . ${codegen_model}
    WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
    OUTPUT_VARIABLE generated_files
    OUTPUT_STRIP_TRAILING_WHITESPACE
)

No, there is no option for stripping leading whitespace …

Putting it all together

All methods provide the list of generated files in the variable generated_files. The complete CMakeLists.txt could look like this:

project(flexible-codegen-demo CXX)
cmake_minimum_required(VERSION 3.10)

set(codegen_impl
    codegen/codegen.py
    codegen/xml_reader.py
    codegen/cpp_writer.py
)
set(codegen_model
    model/miaow.xml
    model/mooo.xml
    model/wuff.xml
)
set_property(DIRECTORY APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS
    ${codegen_impl} ${codegen_model}
)

execute_process(
    COMMAND python codegen/codegen.py -o . ${codegen_model}
    WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
)
include(generated_files.cmake)

add_executable(${PROJECT_NAME}
    ${generated_files}
    main.cpp
    # ... other source files ...
)

I still need to play around with this some more. But so far things are working fine. If you start a clean build with the usual incantation

mkdir build && cd build
cmake .. && make && ./flexible-codegen-demo

CMake configures itself and runs the generator. Then my little demo project builds successfully and executes correctly. Also CMake apparently sets up the dependencies as expected. For example, if you touch one of the model files and call

make

an incremental CMake reconfigure runs, followed by an incremental rebuild – just as expected. Touching one of the generated files triggers a rebuild without a reconfigure – also as expected.

Caveats

All of the above was only a toy example. Here are some possible stumbling blocks, or details to flesh out in a real implementation.

Obsolete files need to be detected reliably. That’s files that were generated in a previous run but aren’t any more. A good generator should take care of deleting them itself, but it’s still a point to watch out for.
What about incremental generation? If only one out of a hundred model files changes, you probably want to save time and not run the complete generator again. But for the non-globbing approaches you still need a complete list of output files. So there must be a cache involved somewhere. Can the generator do it? Otherwise using a CMake CACHE variable might be a solution.
Do flexible generators in subprojects cause any problems? A bunch of add_subdirectory() isn’t exactly uncommon in a real-life project.

I’ll leave it at that for this post. These or similar points are going to crop up once I throw this at a real project; and I’ll dive into some or all of them then.