Adjust some of the existing tests in service_level_controller_test.cc
and add some more in order to test the workload prioritization features,
i.e. the service level shares.
Now, the CREATE statements generated for each service level by the
DESCRIBE SCHEMA WITH INTERNALS statement will account for the service
level's shares.
Introduce the "SHARES" keyword which can be used in conjunction with
existing CQL statements related to the service levels.
Adjust the CQL statements for service levels:
- CREATE/ALTER now allow to set shares (only if the cluster is fully
upgraded)
- LIST EFFECTIVE SERVICE LEVEL now return the number of shares in a new
column
- LIST SERVICE LEVEL(S) also return the number of shares, and has the
additional column "percentage of all service level shares"
Now, when the user logs in and the connection becomes authenticated, the
processing loop of the connection is switched to the scheduling group
that corresponds to the service level assigned to the logged in user.
The scheduling group is also updated when the service level assigned to
this user changes.
Starting from this commit, the scheduling groups managed by the service
level controller are actually being used by user workload.
In order to make sure that the scheduling group carries over RPC, and
also to prevent priority inversion issues between different service
levels, modify the messaging service to use separate RPC connections for
each service level in order to serve user traffic.
The above is achieved by reusing the existing concept of "tenants" in
messaging service: when a new service level (or, more accurately,
service-level specific scheduling group) is first used in an RPC, a
new tenant is created.
In addition, extend the service level controller to be able to quickly
look up the service level name of the currently active scheduling group
in order to speed up the logic for choosing the tenant.
Replace the reader concurrency semaphores for user reads and view
updates with the newly introduced reader concurrency semaphore group,
which assigns a semaphore for each service level.
Each group is statically assigned to some pool of memory on startup and
dynamically distribute this memory between the semaphores, relative to
the number of shares of the corresponding scheduling group.
The intent of having a separate reader concurrency semaphore for each
scheduling group is to prevent priority inversion issues due to reads
with different priorities waiting on the same semaphore, as well as make
memory allocation more fair between service levels due to the adjusted
number of shares.
Introduce the core logic of workload prioritization, responsible for
assigning scheduling groups to service levels.
The service level controller maintains a pool of scheduling groups for
the currently present service levels, as well as a pool of unused
scheduling groups which were previously used by some service level that
was deleted during node's lifetime.
When a new service level is created, the SL controller either assigns a
scheduling group from the unused SG pool, or creates a new one if the
pool is empty. The scheduling group is renamed to "sl:<scheduling group
name>".
When updating shares of a service level (and also when creating a new
service level), the shares of the corresponding scheduling group are
synchronized with those of the service level.
When a service level is deleted, its group is released to the
aforementioned pool of unused scheduling groups and the prefix of its
name is changed from "sl:" to "sl_deleted:".
For now, these scheduling groups are not used by any user operations.
This will be changed in subsequent commits.
Add service level shares related fields to service_level_options and
slo_effective_names structs, and adjust the existing methods of the
former (merge_with, init_effective_names) to account for them.
The service levels table is queried with a `SELECT * ...` query, by
using the `execute_internal` method which prepares and caches the query
in an special cache for internal queries, separate from the user query
cache.
During rolling upgrade from a version which does not support service
level shares to the one that does, the `shares` column is added. The
aforementioned internal query cache is _not_ invalidated on schema
change, so the cache might still contain the prepared query from the
time before the column was added, and that prepared query will fetch the
old set of column without the new `shares` column.
In order to solve this, explicitly specify the columns in the query
string, using the full set of column names from the time when the query
is executed.
Note that this is a problem only for the legacy, non-raft service
levels. Raft-based service levels use a local table for which the schema
is determined on startup.
Also note that this code only fetches values from the `shares` column
but does not make any use of it otherwise. It will be handled by later
commits in this series.
Add the "shares" column to the
system_distributed_keyspace.service_levels table, which is used by
legacy code.
Because this table is in a distributed and not local keyspace, adding
the column to an existing cluster during rolling upgrade requires a bit
of care. A callback is added to the workload prioritization cluster
feature which runs when the feature becomes enabled and adds the column
for all nodes in the cluster.
Add a "shares" column which hold the number of shares allocated to
given service level.
It is not used by the code at all right now, subsequent commits will
make good use of it.
Information about the number of shares per service level will be stored
in an additional column in the service levels table, which is managed
through group0. We will need the feature to make sure that all nodes in
the cluster know about the new column before any node starts applying
group0 commands the would touch the new column.
This feature also serves a role for the legacy service levels
implementation that uses system_distributed for storage: after all nodes
are upgraded to support workload prioritization, one of the nodes will
perform a schema change operation and will add the new column.
Workload prioritization assigns scheduling groups to service levels, and
the number of scheduling groups that can exist at the same time is
limited with a compile-time parameter in seastar. The documentation for
workload prioritization says that we currently support 7 user-managed
service levels and 1 created by default. Increase the current
compile-time limit in order to align with the documentation.
The `nonexistant_service_level_exception` can be thrown by service
levels code and propagated up to the CQL server layer, where it is
converted into a CQL protocol error. The aforementioned exception
inherits from `service_level_argument_exception`, which in turn inherits
from `std::invalid_argument` - which doesn't mean much to the CQL layer
and is converted to a generic SERVER_ERROR.
We can do better and return a more meaningful error code for this
exception. Change the base class of service_level_argument_exception to
exceptions::invalid_request_exception which gets converted to an INVALID
error.
The INVALID error code was already being used by the enterprise version,
so this commit just synchronizes error handling with enterprise.
This adds to the grammar the option to SELECT a specific element in a collection (map/set/list).
For example:
`SELECT map['key'] FROM table`
`SELECT map['key1']['key2'] FROM table`
This feature was implemented in Cassandra 4.0 and was requested by scylla users.
The behavior is mostly compatible with Cassandra, except:
1. in SELECT, we allow list subscript in a selector, while cassandra allows only map and set.
2. in UPDATE, we allow set subscript in a column condition, while cassandra allows only map and list.
3. the slice syntax `SELECT m[a..b]` is not implemented yet
4. null subscript - `SELECT m[null]` returns null in scylla, while cassandra returns error
Fixes#7751
backport was requested for a user to be able to use it
Closesscylladb/scylladb#22051
* github.com:scylladb/scylladb:
cql3: allow SELECT of specific collection key
cql3: allow set subscript
Although the `network_topology_stratergy::make_replication_map` ->
`tablet_aware_replication_strategy::do_make_replication_map`
is not cpu intensive it still allocates and constructs a shared
`tablet_effective_replication_map`, and that might stall with
thousands of tablet-based tables.
Therefore coroutinize the preparation loop to allow yielding.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We already ignore a gossiper entries with host id equal to local host id
in raft mode since those entries are just outdated entries since before
ip change. The same logic applies to gossiper mode as well though, so do
the same in both modes.
Fixes: scylladb/scylladb#21930
Message-ID: <Z20kBZvpJ1fP9WyJ@scylladb.com>
This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows:
After the patch, messaging_service (Scylla's interface for all inter-node communication)
compresses its network traffic with compressors managed by
the new advanced_rpc_compression::tracker. Those compressors compress with lz4,
but can also be configured to use zstd as long as a CPU usage limit isn't crossed.
A precomputed compression dictionary can be fed to the tracker. Each connection
handled by the tracker will then start a negotiation with the other end to switch
to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary.
All traffic going through the tracker is passed as a single merged "stream" through dict_sampler.
dictionary_service has access to the dict_sampler.
On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the alien_worker thread) and publishes the new dictionary
to system.dicts via Raft's write_mutation command.
This update triggers (eventually) a callback on all nodes, which feeds the new dictionary
to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections
to this dictionary.
Closesscylladb/scylladb#22032
* github.com:scylladb/scylladb:
messaging_service: use advanced_rpc_compression::tracker for compression
message/dictionary_service: introduce dictionary_service
service: make Raft group 0 aware of system.dicts
db/system_keyspace: add system.dicts
utils: add advanced_rpc_compressor
utils: add dict_trainer
utils: introduce reservoir_sampling
utils: introduce alien_worker
utils: add stream_compressor
Logging randomization parameters in the pytest_generate_tests hook doesn't
play well for us. To make these parameters more visible move the logging
to the test level.
Closesscylladb/scylladb#22055
This changeset ports LTO and PGO support from scylla-enterprise.git to scylladb.git.
Add support for Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO)
to improve performance. LTO provides ~7% performance gain and enables crucial
binary layout optimizations for PGO.
LTO Changes:
- Add `-flto` flag to compile and link steps
- Use `-ffat-lto-objects` to generate both LLVM IR and machine code
- Enable cross-object optimization while maintaining fast test linking
PGO Implementation:
- Implement three-stage build process:
1. Context-free profiling (`-fprofile-generate`)
2. Context-sensitive profiling (`-fprofile-use` + `-fcs-profile-generate`)
3. Final optimization using merged profiles
- Add release-pgo and release-cs-pgo build stages
- Integrate with ninja build system
- Stages can be enabled independently
Profile Management:
- Add `pgo/pgo.py` for workload profile collection
- Store default profile in `pgo/profiles/profile.profdata.xz` using Git LFS
- Add configure.py integration for profile detection and validation
- Support custom profiles via `--use-profile` flag
- Add profile regeneration script
Both optimizations are recommended for maximum performance, though each PGO
stage adds a full build cycle. Future optimization may allow dropping one
PGO stage if performance impact is minimal.
---
this is a forward port, hence no need to backport.
Closesscylladb/scylladb#22039
* github.com:scylladb/scylladb:
build: cmake: add CMake options for PGO support
build: cmake: add "Scylla_ENABLE_LTO" option
build: set LTO and PGO flags for Seastar in cmake build
build: collect scylla libraries with `scylla_libs` variable
build: Unify Abseil CXX flags configuration
configure.py: prepare the build for a default PGO profile in version control
configure.py: introduce profile-guided optimization
pgo: add alternator workloads training
pgo: add a repair workload
pgo: add a counters workload
pgo: add a secondary index workload
pgo: add a LWT workload
pgo: add a decommission workload
pgo: add a clustering workload
pgo: add a basic workload
pgo: introduce a PGO training script
configure.py: don't include non-default modes in dist-server-* rules
configure.py: enable LTO in release builds by default
configure.py: introduce link-time optimization
configure.py: add a `default` to `add_tristate`.
configure.py: unify build rules for cxxbridge .cc files and regular .cc files
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.
Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786
This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a member and
storage_group_manager::stop() consumes this future during table::stop().
- storage_service: replicate_to_all_cores: update base and view tables atomically
Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)
To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.
* Regression was introduced in f2ff701489, so backport is required to 6.x, 2024.2
Closesscylladb/scylladb#21781
* github.com:scylladb/scylladb:
storage_service: replicate_to_all_cores: clear_gently pending erms
test_mv_topology_change: drop delay_after_erm_update injection case
storage_service: replicate_to_all_cores: update base and view tables atomically
table: make update_effective_replication_map sync again
This adds to the grammar the option to SELECT a specific key in a
collection column using subscript syntax.
For example:
SELECT map['key'] FROM table
SELECT map['key1']['key2'] FROM table
The key can also be parameterized in a prepared query. For this we need
to pass the query options to result_set_builder where we process the
selectors.
Fixesscylladb/scylladb#7751
test.py: only access combined_tests executable if it is built
Fixes#22038Closesscylladb/scylladb#22069
* github.com:scylladb/scylladb:
test.py: only access combined_tests if it exists
test.py: rethrow CancelledError when executing a test
If authentication is enabled, but STARTUP isn't followed by REGISTER (which is optional, and in practice only happens on only one of a driver's connections — because there's no point listening for the same events on multiple connections), connections are wrongly displayed in the system.clients as AUTHENTICATING instead of READY, even when they are ready.
This commit fixes this problem.
Fixes: scylladb/scylladb#12640Closesscylladb/scylladb#21774
`record_property` generates XML which is not compatible with xunit2,
so pytest decided to deprecated when the generating xunit reports.
and pytest generates following warning when a test failure is
reported using this fixture:
```
object_store/test_backup.py:337: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
```
this warning is not related to the test, but more about how we
report a failure using pytrest. it is distracting, so let's silence it.
See also https://github.com/pytest-dev/pytest/issues/5202
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22067
There are many CI failures (repros of https://github.com/scylladb/scylladb/issues/21534) which caused by `stop_after_setting_mode_to_normal_raft_topology` and `stop_before_becoming_raft_voter` error injections in combination with some cluster events.
Need to deselect them for now to make CI more stable. First batch deselected in https://github.com/scylladb/scylladb/pull/21658
Also, add the handling of topology state rollback caused by `stop_before_streaming` or `stop_after_updating_cdc_generation` error injections as a separate commit.
See also https://github.com/scylladb/scylladb/issues/21872 and https://github.com/scylladb/scylladb/issues/21957
Closes scylladb/scylladb#22044
* github.com:scylladb/scylladb:
test.py: topology_random_failures: more deselects for #21534
test.py: topology_random_failures: handle more node's hangs during 30s sleep
This allows to use subscript on a set column, in addition to map/list
which was possible until now.
The behavior is compatible with Cassandra - a subscript with a specific value
returns the value if it's found in the set, and null otherwise.
When the scylla source tree is only partially built,
we still may want to run the tests.
test.py builds a case cache at boot, and executes
--list-cases for that, for all built tests.
After amalgamating boost unit tests into a single
file, it started running it unconditionally, which broke
partial builds.
Hence, only use combined_tests executable if it exists.
Fixes#22038
Commit 870f3b00fc,
"Add option to fail after number of failures" adds
tracking on the number of cancelled tests.
For the purpose, it intercepts CancelledError
and sets test's is_cancelled flag.
This introduced a regression reported in gh-21636:
Ctrl-C no longer works, since CancelledError is muted.
There was no intent to mute the exception,
re-throw it after accounting the test as cancelled.
This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`,
`dict_sampler` and `dictionary_service` in `main()`, and wires them to each other
and to `messaging_service`.
`messaging_service` compresses its network traffic with compressors managed by
the `advanced_rpc_compression::tracker`. All this traffic is passed as a single
merged "stream" through `dict_sampler`.
`dictionary_service` has access to `dict_sampler`.
On chosen nodes (by default: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the `alien_worker` thread) and publishes the new dictionary
to `system.dicts` via Raft.
This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes,
which updates the dictionary used by the compressors it manages.
- "Scylla_BUILD_INSTRUMENTED" option
Scylla_BUILD_INSTRUMENTED allows us to instrument the code at
different level, namely, IR, and CSIR. this option mirrors
"--pgo" and "--cspgo" options in `configure.py` . please note,
the instrumentation at the frontend is not supported, as the IR
based instrumentation is better when it comes to the use case of
optimization for performance.
see https://lists.llvm.org/pipermail/llvm-dev/2015-August/089044.html
for the rationales.
- "Scylla_PROFDATA_FILE" option
this option allows us to specify the profile data previous generated
with the "Scylla_BUILD_INSTRUMENTED" option. this option mirrors
the `--use-profile` option in `configure.py`, but it does not
take the empty option as a special case and consider it as a file
fetched from Git LFS. that will be handled by another option in a
follow-up change. please note, one cannot use
-DScylla_BUILD_INSTRUMENTED=PGO and -DScylla_PROFDATA_FILE=...
at the same time. clang just does not allow this. but CSPGO is fine.
- "Scylla_PROFDATA_COMPRESSED_FILE" option
this option allows us to specify the compressed profile data previouly
generated with the "Scylla_BUILD_INSTRUMENTED" option. along with
"Scylla_PROFDATA_FILE", this option mirros the functionality of
`--use-profile` in `configure.py`. the goal is to ensure user always
gets the result with the specified options. if anything goes wrong,
we just error out.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
add an option named "Scylla_ENABLE_LTO", which is off by default.
if it is on, build the whole tree with ThinLTO enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This change extends scylla commit 7cb74df to scylla-enterprise-commit
4ece7e1.
we recently started building Seastar as an external project, so
we need to prepare its compilation flags separately. in enterprise
scylla, we prepare the LTO and PGO related cflags in
`prepare_advanced_optimizations()`. this function is called when
preparing the build rules directly from `configure.py`, and despite
we have equivalant settings in CMake, they cannot be applied to Seastar
due to the reason above.
in this change, we set up the the LTO and PGO compilation flags when
generating the buiding system for Seastar when building using CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
- Set ABSL_GCC_FLAGS and ABSL_LLVM_FLAGS with a more generic absl_cxx_flags
- Enables more flexible configuration of compiler flags for Abseil libraries
- Provides a centralized approach to setting compilation flags
Previously, sanitizer-specific flags were directly applied to Abseil library builds.
This change allows for more extensible compiling flag management across
different build configurations.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This patch adds the following logic to the release build:
pgo/profiles/profile.profdata.xz is the default profile file, compressed.
This file is stored in version control using git LFS.
A ninja rule is added which creates build/profile.profdata by decompressing it.
If no profile file is explicitly specified, ./configure.py checks whether
the compressed default profile file exists and is compressed.
(If it exists, but isn't compressed, the user most likely has
git lfs disabled or not installed. In this case, the file visible in the working
tree will be the LFS placeholder text file describing the LFS metadata.)
If the compressed file exists, build/profile.profdata is chosen as the used
profile file.
If it doesn't exist, a warning is printed and configure.py falls back
to a profileless build.
The default profile file can be explicitly disabled by passing the empty
--use-profile="" to configure.py
A script is added which re-generates the profile.
After the script is run, the re-generated compressed profile can be staged,
committed, pushed and merged to update the default profile.
This commit enables profile-guided optimizations (PGO) in the Scylla build.
A full LLVM PGO requires 3 builds:
1. With -fprofile-generate to generate context-free (pre-inlining) profile. This
profile influences inlining, indirect-call promotion and call graph
simplifications.
2. With -fprofile-use=results_of_build_1 -fcs-profile-generate to generate
context-sensitive (post-inlining) profile. This profile influences post-inline
and codegen optimizations.
3. With -fprofile-use=merged_results_of_builds_1_2 to build the final binary
with both profiles.
We do all three in one ninja call by adding release-pgo and release-cs-pgo
"stages" to release. They are a copy of regular release mode, just with the
flags described above added. With the full course, release objects depend on the
profile file produced by build/release-cs-pgo/scylla, while release-cs-pgo
depends on the profile file generated by build/release-pgo/scylla.
The stages are orthogonal and enabled with separate options. It's recommended
to run them both for full performance, but unfortunately each one adds a full
build of scylla to the compile time, so maybe we can drop one of them in the
future if it turns out e.g. that regular PGO doesn't have a big effect.
It's strongly recommended to combine PGO with LTO. The latter enables the entire
class of binary layout optimizations, which for us is probably the most
important part of the entire thing.
This patch adds a set of alternator workloads to pgo training
script.
To confirm that added workloads are indeed affecting profile we can compare:
⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/clustering/prof.profdata
Instrumentation level: IR entry_first = 0
Total functions: 105075
Maximum function count: 1079870885
Maximum internal block count: 2197851358
and
⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/alternator/prof.profdata
Instrumentation level: IR entry_first = 0
Total functions: 105075
Maximum function count: 5240506052
Maximum internal block count: 9112894084
to see that function counters are on similar levels, they are around 5x higher for alternator
but that's because it combines 5 specific sub-workloads.
To confirm that final profile contains alterantor functions we can inspect:
⤖ llvm-profdata show --counts --function=alternator --value-cutoff 100000 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR entry_first = 0
Functions shown: 356
Total functions: 105075
Number of functions with maximum count (< 100000): 97275
Number of functions with maximum count (>= 100000): 7800
Maximum function count: 7248370728
Maximum internal block count: 13722347326
we can see that 356 functions which symbol name contains word alternator were identified as 'hot' (with max count grater than 100'000). Running:
⤖ llvm-profdata show --counts --function=alternator --value-cutoff 1 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR entry_first = 0
Functions shown: 806
Total functions: 105075
Number of functions with maximum count (< 1): 67036
Number of functions with maximum count (>= 1): 38039
Maximum function count: 7248370728
Maximum internal block count: 13722347326
we can see that 806 alternator functions were executed at least once during training.
And finally to confirm that alternator specific PGO brings any speedups we run:
for workload in read scan write write_gsi write_rmw
do
./build/release/scylla perf-alternator-workloads --smp 4 --cpuset "10,12,14,16" --workload $workload --duration 1 --remote-host 127.0.0.1 2> /dev/null | grep median
done
results BEFORE:
median 258137.51910849303
median absolute deviation: 786.06
median 547.2578202937141
median absolute deviation: 6.33
median 145718.19856685458
median absolute deviation: 5689.79
median 89024.67095807113
median absolute deviation: 1302.56
median 43708.101729598646
median absolute deviation: 294.47
results AFTER:
median 303968.55333940056
median absolute deviation: 1152.19
median 622.4757636209254
median absolute deviation: 8.42
median 198566.0403745328
median absolute deviation: 1689.96
median 91696.44912842038
median absolute deviation: 1891.84
median 51445.356525664996
median absolute deviation: 1780.15
We can see that single node cluster tps increase is typically 13% - 17% with notable exceptions,
improvement for write_gsi is 3% and for write workload whopping 36%.
The increase is on top of CQL PGO.
Write workload is executed more often because it's involved also as data preparation for read and scan.
Some further improvement could be to separate preparation from training as it's done for CQL but it would
be a bit odd if ~3x higher counters for one flow have so big impact.
Additional disclaimers:
- tests are performing exactly the same workloads as in training so there might be some bias
- tests are running single node cluster, more realistic setup will likely show lower improvement
Fixes https://github.com/scylladb/scylla-enterprise/issues/4066
This workload is added to teach PGO about repair.
Tests are inconclusive about its alignment with existing workloads,
because repair doesn't seem utilize 100% of the reactor.
This workload is added to teach PGO about counters.
Tests seem to show it's mostly aligned with existing CQL workloads.
The config YAML is based on the default cassandra-stress schema.
This workload is added to teach PGO about secondary indexes.
Tests seem to show that it's mostly aligned with existing CQL workloads.
The config YAML was copied from one of scylla-cluster-test test cases.
This workload is added to teach PGO about LWT codepaths.
Tests seem to show that it's mostly aligned with existing CQL workloads.
The config YAML was copied from one of scylla-cluster-tests test cases.
This workload is added to teach PGO about streaming.
Tests show that this workload is mostly orthogonal to CQL workloads
(where "orthogonal" means that training on workload A doesn't improve workload
B much, while training on workload A doesn't improve workload B much),
so adding it to the training is quite important.
In contrast to the basic workload, this workload uses clustering
keys, CK range queries, RF=1, logged batches, and more CQL types.
Tests seem to show that this workload is mostly aligned with the existing basic
workload (where "aligned" means that training on workload A improves workload B
about as much as training on workload B).
The config YAML is based on the example YAML attached to cassandra-stress
sources.
Profile-guided optimization consists of the following steps:
1. Build the program as usual, but with with special options (instrumentation
or just some supplementary info tables, depending on the exact flavor of PGO
in use).
2. Collect an execution profile from the special binary by running a
training workload on it.
3. Rebuild the program again, using the collected profile.
This commit introduces a script automating step 2: running PGO training workloads
on Scylla. The contents of training workloads will be added in future commits.
The changes in configure.py responsible for steps 1. and 3. will also appear
in future commits.
As input, the script takes a path to the instrumented binary, a path to a
the output file, and a directory with (optionally) prepopulated datasets for use
in training. The output profile file can be then passed to the compiler to
perform a PGO build.
The script current supports two kinds of PGO instrumentation: LLVM instrumentation
(binary instrumented with -fprofile-generate and -fcs-profile-generate passed to
clang during compilation) and BOLT instrumentation (binary instrumented with
`llvm-bolt -instrument`, with logs from this operation saved to
$binary_path.boltlog)
The actual training workloads for generating the profile will be added in later
commits.