Files
scylladb/docs/dev/compilation-time-analysis.md
TripleChecker e72e6fadeb Fix typos
2025-02-11 00:17:43 +02:00

255 lines
11 KiB
Markdown

ScyllaDB's long build time has been a problem almost since its inception -
our issue #1 - https://github.com/scylladb/scylladb/issues/1 - is about it.
To reduce Scylla's compilation time, it's important to understand what is
causing it.
In the old days of C compilers, the compilation time was mostly
proportional to the amount of source code. The only way to reduce compilation
time without deleting source code was to reduce the amount of unnecessary
header-file inclusion. However, in modern template-heavy C++ code, such
as Seastar and ScyllaDB, not all header files slow down the build by
the same amount. Template instantiation and code generation takes orders
of magnitude more time than just parsing - and it's possible that one
header file adds a millisecond to the build time, while another header
file of the same length adds a full second.
Clang has a compile-time option, "-ftime-trace", for tracing the
compilation time of a single source file, to find which parts of the
compilation took most time. This document is about how to use this
option together with a tool `ClangBuildAnalyzer` - which aggregates the
`-ftime-trace` logs from all the source files to decide which header files
and templates are slowing down build the most.
# Installing ClangBuildAnalyzer
Get ClangBuildAnalyzer:
```
git clone https://github.com/aras-p/ClangBuildAnalyzer
```
Build it with `make -f projects/make/Makefile`, and install the resulting
`build/ClangBuildAnalyzer`.
# Getting started with ClangBuildAnalyzer
In your Scylla working directory, start with a fresh "build" directory
to avoid measurement of already-deleted source files or irrelevant build
modes:
```
rm -r build
```
Next, rebuild build.ninja to add `-ftime-trace` to the compilation commands.
Run your favorite `configure.py` command line, adding `--cflags=-ftime-trace`.
For example:
```
./configure.py --disable-dpdk --cflags=-ftime-trace
```
Now build the main `scylla` executable, in one build mode. For example,
```
ninja build/dev/scylla
```
This build will create in `build/dev`, for each source file compiled, a
**separate** JSON file with the compile-time analysis of it. For example,
for `cdc/generation.cc`, a file `build/dev/cdc/generation.json` will be
created.
These JSON files use Chrome's trace format, which can be viewed individually
in various tools including Google's https://ui.perfetto.dev/.
The ClangBuildAnalyzer tool which we installed above allows us to combine
the information from the separate source files to find the worst **overall**
offenders. Run it as:
```
ClangBuildAnalyzer --all build/dev build/dev/clba.bin
ClangBuildAnalyzer --analyze build/dev/clba.bin | less
```
The `-all` command merges all the ".json" files in build/dev into one
merged binary file `build/dev/clba.bin`. The "--analyze" step generates
a human-visible output. The next section is about understanding this output,
but before that, some comments on how to iteratively use ClangBuildAnalyzer
after changing the code to see how compilation performance changed:
* Because adding "-ftime-trace" changes the command line, the first "ninja"
build described above will need to rebuild everything, even if ccache
was used to already build this executable. Afterwards, ccache unfortunately
avoids any caching while the "-ftime-trace" option is in use (this is
documented in https://ccache.dev/releasenotes.html, only stating that
it is "too hard").
* After changing the source code, you only need to rerun the "ninja" and
"ClangBuildAnalyzer" steps above. The "ninja" will only recompile the
source files that need recompiling, these source files' ".json" output
files will change, and the "ClangBuildAnalyzer" step will use the new
json files - in addition to those that didn't change - to build the new
up-to-date statistics.
* If a source file is removed from the project, its json file will stay
behind in build/dev/ and add wrong information to ClangBuildAnalyzer.
So remove that json file manually, or clean everything up by removing
the entire "build" directory and compiling again.
* If you no longer want "ninja" to measure build time and write the json
file, re-run configure.py without the `--cflags=-ftime-trace`.
After making a change to the source code, it is sometimes hard to measure
the small improvement to build time because build time measurement isn't
very accurate. We noticed in the past that the build time is highly
correlated with the total produced object size:
```
du -bc build/dev/**/*.o
```
I.e., the C++ compiler is usually slow because it generates a huge amount
of code. So this `du` output can be used as a good estimate of how much
build time we saved by a change to the code.
# Using ClangBuildAnalyzer output
This section is about how to make use of the output of
```
ClangBuildAnalyzer --analyze build/dev/clba.bin
```
The first two sections of the output - "Files that took longest to parse"
and "Files that took longest to codegen", list the source files (`*.cc`)
which took the longest to build. As of this writing, the top offender
is `messaging_service.cc` taking 84 seconds to parse and 135 second to
compile:
```
**** Files that took longest to parse (compiler frontend):
84164 ms: build/dev/message/messaging_service.o
...
**** Files that took longest to codegen (compiler backend):
135417 ms: build/dev/message/messaging_service.o
```
There is little to be done with this sort of information - it doesn't
teach us _why_ this one source file, only 1,376 lines of code, takes
almost 4 minutes of CPU to compile. It might have some use, for finding
the slowest-to-build files to build them first (to avoid stragglers in
a parallel build) or to split them to even smaller files.
The next section, is more interesting:
```
**** Templates that took longest to instantiate:
385738 ms: fmt::detail::vformat_to<char> (552 times, avg 698 ms)
197016 ms: boost::basic_regex<char>::assign (329 times, avg 598 ms)
195522 ms: ser::deserialize<seastar::simple_memory_input_stream> (1617 times, av
g 120 ms)
...
```
The top templates in this list have the unfortunate distinction of being
used in many source files (often by appearing in a header file used in
many places), and taking a very long time (as much as 0.7 seconds in the
above example) to compile each time it is used. We can save a lot of
compilation time if we reduce the number of times each is used (perhaps
by avoiding it in a popular header file, or by using this header file
less), change the template to use fewer templates inside it, or use
"extern template" to only instantiate it once.
The next section,
```
*** Template sets that took longest to instantiate:
640682 ms: std::unique_ptr<$> (31372 times, avg 20 ms)
581171 ms: std::__and_<$> (351772 times, avg 1 ms)
```
Is probably less useful directly. We can perhaps hope that other changes
will reduce the number of time that `std::unique_ptr<>` needs to be
instantiated, but can't do anything directly to improve it.
In the next section, we see the slowest to compile functions:
```
**** Functions that took longest to compile:
2899 ms: cql3_parser::CqlParser::cqlStatement() (build/dev/gen/cql3/CqlParser.cpp)
2144 ms: replica::database::setup_metrics() (replica/database.cc)
```
Maybe we can take a look at some of these functions to figure out why they
are so slow to compile - e.g., a short function like alternator::stats::stats()
taking a full second to compile because of inefficient Boost options tricks.
But in general, looking at this section is probably not worth the effort,
because each of these functions is only compiled **once**, not hundreds of
times as in the template instantiations above, so even if we improve some of
them, the saving will be small.
The next section is similar, but for template functions (so can be compiled
multiple times):
```
*** Function sets that took longest to compile / optimize:
57092 ms: fmt::v9::appender fmt::v9::detail::write_int_noinline<$>(fmt::v9::appender, fmt::v9::detail::write_int_arg<$>, fmt::v9::basic_format_specs<$> const&, fmt::v9::detail::locale_ref) (1071 times, avg 53 ms)
44361 ms: fmt::v9::detail::format_dragon(fmt::v9::detail::basic_fp<$>, unsigned int, int, fmt::v9::detail::buffer<$>&, int&) (357 times, avg 124 ms)
```
The next section is one of the most interesting ones, about the most
expensive header files:
```
1086240 ms: replica/database.hh (included 139 times, avg 7814 ms), included via:
47x: query_processor.hh wasm.hh
46x: <direct include>
14x: schema_registry.hh
6x: user_function.hh wasm.hh
5x: cql_test_env.hh
4x: column_family.hh
3x: wasm.hh
3x: storage_service.hh tablet_allocator.hh
3x: server.hh query_processor.hh wasm.hh
2x: storage_service.hh data_listeners.hh schema_registry.hh
1x: test_services.hh
...
```
Here we see that a whopping 1086 seconds - that's about 6% of the total
build time - was spent compiling replica/database.hh 139 times - almost
8 seconds each. We should make an effort to include database.hh less,
split it, make it do less. Note also how out of the 139 source files
that included database.hh, only 46 included it directly - others included
it through `wasm.hh` and `query_processor.hh` - are those indirect
includes necessary?
It's worth noting that the cost of 1086 seconds is an **overestimation**.
In practice, deleting this header file will not reduce 1086 seconds of the
compilation time. The reason is that database.hh probably includes other
headers and instantiates templates which get "billed" to it because this
header happened, by chance, to be the first to use them in this compilation
unit; But other headers use those headers and templates too, so they would
have been instantiated anyway even if database.hh got deleted. And it's
hard to know this without trying.
# ClangBuildAnalyzer.ini
As show above, ClangBuildAnalyzer makes lists of various types (slowest
source files, slowest templates, etc.) and each of these show a fixed
number of top items. This number, and other things, can be configured in a
file ClangBuildAnalyzer.ini. See an example of how to configure it in
https://github.com/aras-p/ClangBuildAnalyzer/blob/main/ClangBuildAnalyzer.ini
Here is another example - the configuration file that I use:
```
[counts]
# files that took most time to parse
fileParse = 20
# files that took most time to generate code for
fileCodegen = 20
# functions that took most time to generate code for
function = 30
# header files that were most expensive to include
header = 1000
# for each expensive header, this many include paths to it are shown
headerChain = 10
# templates that took longest to instantiate
template = 50
# Minimum times (in ms) for things to be recorded into trace
[minTimes]
# parse/codegen for a file
file = 1
[misc]
# Maximum length of symbol names printed; longer names will get truncated
maxNameLength = 270
# Only print "root" headers in expensive header report, i.e.
# only headers that are directly included by at least one source file
onlyRootHeaders = false
```