255 lines
11 KiB
Markdown
255 lines
11 KiB
Markdown
ScyllaDB's long build time has been a problem almost since its inception -
|
|
our issue #1 - https://github.com/scylladb/scylladb/issues/1 - is about it.
|
|
To reduce Scylla's compilation time, it's important to understand what is
|
|
causing it.
|
|
|
|
In the old days of C compilers, the compilation time was mostly
|
|
proportional to the amount of source code. The only way to reduce compilation
|
|
time without deleting source code was to reduce the amount of unnecessary
|
|
header-file inclusion. However, in modern template-heavy C++ code, such
|
|
as Seastar and ScyllaDB, not all header files slow down the build by
|
|
the same amount. Template instantiation and code generation takes orders
|
|
of magnitude more time than just parsing - and it's possible that one
|
|
header file adds a millisecond to the build time, while another header
|
|
file of the same length adds a full second.
|
|
|
|
Clang has a compile-time option, "-ftime-trace", for tracing the
|
|
compilation time of a single source file, to find which parts of the
|
|
compilation took most time. This document is about how to use this
|
|
option together with a tool `ClangBuildAnalyzer` - which aggregates the
|
|
`-ftime-trace` logs from all the source files to decide which header files
|
|
and templates are slowing down build the most.
|
|
|
|
# Installing ClangBuildAnalyzer
|
|
|
|
Get ClangBuildAnalyzer:
|
|
```
|
|
git clone https://github.com/aras-p/ClangBuildAnalyzer
|
|
```
|
|
Build it with `make -f projects/make/Makefile`, and install the resulting
|
|
`build/ClangBuildAnalyzer`.
|
|
|
|
# Getting started with ClangBuildAnalyzer
|
|
In your Scylla working directory, start with a fresh "build" directory
|
|
to avoid measurement of already-deleted source files or irrelevant build
|
|
modes:
|
|
```
|
|
rm -r build
|
|
```
|
|
Next, rebuild build.ninja to add `-ftime-trace` to the compilation commands.
|
|
Run your favorite `configure.py` command line, adding `--cflags=-ftime-trace`.
|
|
For example:
|
|
|
|
```
|
|
./configure.py --disable-dpdk --cflags=-ftime-trace
|
|
```
|
|
|
|
Now build the main `scylla` executable, in one build mode. For example,
|
|
```
|
|
ninja build/dev/scylla
|
|
```
|
|
|
|
This build will create in `build/dev`, for each source file compiled, a
|
|
**separate** JSON file with the compile-time analysis of it. For example,
|
|
for `cdc/generation.cc`, a file `build/dev/cdc/generation.json` will be
|
|
created.
|
|
|
|
These JSON files use Chrome's trace format, which can be viewed individually
|
|
in various tools including Google's https://ui.perfetto.dev/.
|
|
The ClangBuildAnalyzer tool which we installed above allows us to combine
|
|
the information from the separate source files to find the worst **overall**
|
|
offenders. Run it as:
|
|
|
|
```
|
|
ClangBuildAnalyzer --all build/dev build/dev/clba.bin
|
|
ClangBuildAnalyzer --analyze build/dev/clba.bin | less
|
|
```
|
|
|
|
The `-all` command merges all the ".json" files in build/dev into one
|
|
merged binary file `build/dev/clba.bin`. The "--analyze" step generates
|
|
a human-visible output. The next section is about understanding this output,
|
|
but before that, some comments on how to iteratively use ClangBuildAnalyzer
|
|
after changing the code to see how compilation performance changed:
|
|
|
|
* Because adding "-ftime-trace" changes the command line, the first "ninja"
|
|
build described above will need to rebuild everything, even if ccache
|
|
was used to already build this executable. Afterwards, ccache unfortunately
|
|
avoids any caching while the "-ftime-trace" option is in use (this is
|
|
documented in https://ccache.dev/releasenotes.html, only stating that
|
|
it is "too hard").
|
|
* After changing the source code, you only need to rerun the "ninja" and
|
|
"ClangBuildAnalyzer" steps above. The "ninja" will only recompile the
|
|
source files that need recompiling, these source files' ".json" output
|
|
files will change, and the "ClangBuildAnalyzer" step will use the new
|
|
json files - in addition to those that didn't change - to build the new
|
|
up-to-date statistics.
|
|
* If a source file is removed from the project, its json file will stay
|
|
behind in build/dev/ and add wrong information to ClangBuildAnalyzer.
|
|
So remove that json file manually, or clean everything up by removing
|
|
the entire "build" directory and compiling again.
|
|
* If you no longer want "ninja" to measure build time and write the json
|
|
file, re-run configure.py without the `--cflags=-ftime-trace`.
|
|
|
|
After making a change to the source code, it is sometimes hard to measure
|
|
the small improvement to build time because build time measurement isn't
|
|
very accurate. We noticed in the past that the build time is highly
|
|
correlated with the total produced object size:
|
|
```
|
|
du -bc build/dev/**/*.o
|
|
```
|
|
I.e., the C++ compiler is usually slow because it generates a huge amount
|
|
of code. So this `du` output can be used as a good estimate of how much
|
|
build time we saved by a change to the code.
|
|
|
|
# Using ClangBuildAnalyzer output
|
|
|
|
This section is about how to make use of the output of
|
|
```
|
|
ClangBuildAnalyzer --analyze build/dev/clba.bin
|
|
```
|
|
|
|
The first two sections of the output - "Files that took longest to parse"
|
|
and "Files that took longest to codegen", list the source files (`*.cc`)
|
|
which took the longest to build. As of this writing, the top offender
|
|
is `messaging_service.cc` taking 84 seconds to parse and 135 second to
|
|
compile:
|
|
```
|
|
**** Files that took longest to parse (compiler frontend):
|
|
84164 ms: build/dev/message/messaging_service.o
|
|
...
|
|
**** Files that took longest to codegen (compiler backend):
|
|
135417 ms: build/dev/message/messaging_service.o
|
|
```
|
|
|
|
There is little to be done with this sort of information - it doesn't
|
|
teach us _why_ this one source file, only 1,376 lines of code, takes
|
|
almost 4 minutes of CPU to compile. It might have some use, for finding
|
|
the slowest-to-build files to build them first (to avoid stragglers in
|
|
a parallel build) or to split them to even smaller files.
|
|
|
|
The next section, is more interesting:
|
|
```
|
|
**** Templates that took longest to instantiate:
|
|
385738 ms: fmt::detail::vformat_to<char> (552 times, avg 698 ms)
|
|
197016 ms: boost::basic_regex<char>::assign (329 times, avg 598 ms)
|
|
195522 ms: ser::deserialize<seastar::simple_memory_input_stream> (1617 times, av
|
|
g 120 ms)
|
|
...
|
|
```
|
|
|
|
The top templates in this list have the unfortunate distinction of being
|
|
used in many source files (often by appearing in a header file used in
|
|
many places), and taking a very long time (as much as 0.7 seconds in the
|
|
above example) to compile each time it is used. We can save a lot of
|
|
compilation time if we reduce the number of times each is used (perhaps
|
|
by avoiding it in a popular header file, or by using this header file
|
|
less), change the template to use fewer templates inside it, or use
|
|
"extern template" to only instantiate it once.
|
|
|
|
The next section,
|
|
```
|
|
*** Template sets that took longest to instantiate:
|
|
640682 ms: std::unique_ptr<$> (31372 times, avg 20 ms)
|
|
581171 ms: std::__and_<$> (351772 times, avg 1 ms)
|
|
|
|
```
|
|
Is probably less useful directly. We can perhaps hope that other changes
|
|
will reduce the number of time that `std::unique_ptr<>` needs to be
|
|
instantiated, but can't do anything directly to improve it.
|
|
|
|
In the next section, we see the slowest to compile functions:
|
|
```
|
|
**** Functions that took longest to compile:
|
|
2899 ms: cql3_parser::CqlParser::cqlStatement() (build/dev/gen/cql3/CqlParser.cpp)
|
|
2144 ms: replica::database::setup_metrics() (replica/database.cc)
|
|
```
|
|
Maybe we can take a look at some of these functions to figure out why they
|
|
are so slow to compile - e.g., a short function like alternator::stats::stats()
|
|
taking a full second to compile because of inefficient Boost options tricks.
|
|
But in general, looking at this section is probably not worth the effort,
|
|
because each of these functions is only compiled **once**, not hundreds of
|
|
times as in the template instantiations above, so even if we improve some of
|
|
them, the saving will be small.
|
|
|
|
The next section is similar, but for template functions (so can be compiled
|
|
multiple times):
|
|
|
|
```
|
|
*** Function sets that took longest to compile / optimize:
|
|
57092 ms: fmt::v9::appender fmt::v9::detail::write_int_noinline<$>(fmt::v9::appender, fmt::v9::detail::write_int_arg<$>, fmt::v9::basic_format_specs<$> const&, fmt::v9::detail::locale_ref) (1071 times, avg 53 ms)
|
|
44361 ms: fmt::v9::detail::format_dragon(fmt::v9::detail::basic_fp<$>, unsigned int, int, fmt::v9::detail::buffer<$>&, int&) (357 times, avg 124 ms)
|
|
```
|
|
|
|
The next section is one of the most interesting ones, about the most
|
|
expensive header files:
|
|
|
|
```
|
|
1086240 ms: replica/database.hh (included 139 times, avg 7814 ms), included via:
|
|
47x: query_processor.hh wasm.hh
|
|
46x: <direct include>
|
|
14x: schema_registry.hh
|
|
6x: user_function.hh wasm.hh
|
|
5x: cql_test_env.hh
|
|
4x: column_family.hh
|
|
3x: wasm.hh
|
|
3x: storage_service.hh tablet_allocator.hh
|
|
3x: server.hh query_processor.hh wasm.hh
|
|
2x: storage_service.hh data_listeners.hh schema_registry.hh
|
|
1x: test_services.hh
|
|
...
|
|
```
|
|
|
|
Here we see that a whopping 1086 seconds - that's about 6% of the total
|
|
build time - was spent compiling replica/database.hh 139 times - almost
|
|
8 seconds each. We should make an effort to include database.hh less,
|
|
split it, make it do less. Note also how out of the 139 source files
|
|
that included database.hh, only 46 included it directly - others included
|
|
it through `wasm.hh` and `query_processor.hh` - are those indirect
|
|
includes necessary?
|
|
|
|
It's worth noting that the cost of 1086 seconds is an **overestimation**.
|
|
In practice, deleting this header file will not reduce 1086 seconds of the
|
|
compilation time. The reason is that database.hh probably includes other
|
|
headers and instantiates templates which get "billed" to it because this
|
|
header happened, by chance, to be the first to use them in this compilation
|
|
unit; But other headers use those headers and templates too, so they would
|
|
have been instantiated anyway even if database.hh got deleted. And it's
|
|
hard to know this without trying.
|
|
|
|
# ClangBuildAnalyzer.ini
|
|
As show above, ClangBuildAnalyzer makes lists of various types (slowest
|
|
source files, slowest templates, etc.) and each of these show a fixed
|
|
number of top items. This number, and other things, can be configured in a
|
|
file ClangBuildAnalyzer.ini. See an example of how to configure it in
|
|
https://github.com/aras-p/ClangBuildAnalyzer/blob/main/ClangBuildAnalyzer.ini
|
|
|
|
Here is another example - the configuration file that I use:
|
|
```
|
|
[counts]
|
|
|
|
# files that took most time to parse
|
|
fileParse = 20
|
|
# files that took most time to generate code for
|
|
fileCodegen = 20
|
|
# functions that took most time to generate code for
|
|
function = 30
|
|
# header files that were most expensive to include
|
|
header = 1000
|
|
# for each expensive header, this many include paths to it are shown
|
|
headerChain = 10
|
|
# templates that took longest to instantiate
|
|
template = 50
|
|
|
|
# Minimum times (in ms) for things to be recorded into trace
|
|
[minTimes]
|
|
# parse/codegen for a file
|
|
file = 1
|
|
|
|
[misc]
|
|
# Maximum length of symbol names printed; longer names will get truncated
|
|
maxNameLength = 270
|
|
# Only print "root" headers in expensive header report, i.e.
|
|
# only headers that are directly included by at least one source file
|
|
onlyRootHeaders = false
|
|
```
|