Commit Graph

46008 Commits

Author SHA1 Message Date
Piotr Dulikowski
bbc655ff32 test/boost: update service_level_controller_test for workload prio
Adjust some of the existing tests in service_level_controller_test.cc
and add some more in order to test the workload prioritization features,
i.e. the service level shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ce4032dfc0 qos: include number of shares in DESCRIBE
Now, the CREATE statements generated for each service level by the
DESCRIBE SCHEMA WITH INTERNALS statement will account for the service
level's shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
0f62eb45d1 cql3/statements: update SL statements for workload prioritization
Introduce the "SHARES" keyword which can be used in conjunction with
existing CQL statements related to the service levels.

Adjust the CQL statements for service levels:

- CREATE/ALTER now allow to set shares (only if the cluster is fully
  upgraded)
- LIST EFFECTIVE SERVICE LEVEL now return the number of shares in a new
  column
- LIST SERVICE LEVEL(S) also return the number of shares, and has the
  additional column "percentage of all service level shares"
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
6d90a933cd transport/server: use scheduling group assigned to current user
Now, when the user logs in and the connection becomes authenticated, the
processing loop of the connection is switched to the scheduling group
that corresponds to the service level assigned to the logged in user.
The scheduling group is also updated when the service level assigned to
this user changes.

Starting from this commit, the scheduling groups managed by the service
level controller are actually being used by user workload.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
f1b9737e07 messaging_service: use separate set of connections per service levels
In order to make sure that the scheduling group carries over RPC, and
also to prevent priority inversion issues between different service
levels, modify the messaging service to use separate RPC connections for
each service level in order to serve user traffic.

The above is achieved by reusing the existing concept of "tenants" in
messaging service: when a new service level (or, more accurately,
service-level specific scheduling group) is first used in an RPC, a
new tenant is created.

In addition, extend the service level controller to be able to quickly
look up the service level name of the currently active scheduling group
in order to speed up the logic for choosing the tenant.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
7383013f43 replica/database: add reader concurrency semaphore groups
Replace the reader concurrency semaphores for user reads and view
updates with the newly introduced reader concurrency semaphore group,
which assigns a semaphore for each service level.

Each group is statically assigned to some pool of memory on startup and
dynamically distribute this memory between the semaphores, relative to
the number of shares of the corresponding scheduling group.

The intent of having a separate reader concurrency semaphore for each
scheduling group is to prevent priority inversion issues due to reads
with different priorities waiting on the same semaphore, as well as make
memory allocation more fair between service levels due to the adjusted
number of shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
4cfd26efaf qos: manage and assign scheduling groups to service levels
Introduce the core logic of workload prioritization, responsible for
assigning scheduling groups to service levels.

The service level controller maintains a pool of scheduling groups for
the currently present service levels, as well as a pool of unused
scheduling groups which were previously used by some service level that
was deleted during node's lifetime.

When a new service level is created, the SL controller either assigns a
scheduling group from the unused SG pool, or creates a new one if the
pool is empty. The scheduling group is renamed to "sl:<scheduling group
name>".

When updating shares of a service level (and also when creating a new
service level), the shares of the corresponding scheduling group are
synchronized with those of the service level.

When a service level is deleted, its group is released to the
aforementioned pool of unused scheduling groups and the prefix of its
name is changed from "sl:" to "sl_deleted:".

For now, these scheduling groups are not used by any user operations.
This will be changed in subsequent commits.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ff51551a94 qos: use the shares field in service level reads/writes
Now, the newly introduced `shares` field is used when service levels are
either read from or written into system tables.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
a6f681029f qos: add shares to service_level_options
Add service level shares related fields to service_level_options and
slo_effective_names structs, and adjust the existing methods of the
former (merge_with, init_effective_names) to account for them.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
2eb35f37d0 qos: explicitly specify columns when querying service level tables
The service levels table is queried with a `SELECT * ...` query, by
using the `execute_internal` method which prepares and caches the query
in an special cache for internal queries, separate from the user query
cache.

During rolling upgrade from a version which does not support service
level shares to the one that does, the `shares` column is added. The
aforementioned internal query cache is _not_ invalidated on schema
change, so the cache might still contain the prepared query from the
time before the column was added, and that prepared query will fetch the
old set of column without the new `shares` column.

In order to solve this, explicitly specify the columns in the query
string, using the full set of column names from the time when the query
is executed.

Note that this is a problem only for the legacy, non-raft service
levels. Raft-based service levels use a local table for which the schema
is determined on startup.

Also note that this code only fetches values from the `shares` column
but does not make any use of it otherwise. It will be handled by later
commits in this series.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ea25b29684 db/system_distributed_keyspace: add shares column and upgrade code
Add the "shares" column to the
system_distributed_keyspace.service_levels table, which is used by
legacy code.

Because this table is in a distributed and not local keyspace, adding
the column to an existing cluster during rolling upgrade requires a bit
of care. A callback is added to the workload prioritization cluster
feature which runs when the feature becomes enabled and adds the column
for all nodes in the cluster.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
346fc84c3e db/system_keyspace: adjust SL schema for workload prioritization
Add a "shares" column which hold the number of shares allocated to
given service level.

It is not used by the code at all right now, subsequent commits will
make good use of it.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ecbf8721de gms: introduce WORKLOAD_PRIORITIZATION cluster feature
Information about the number of shares per service level will be stored
in an additional column in the service levels table, which is managed
through group0. We will need the feature to make sure that all nodes in
the cluster know about the new column before any node starts applying
group0 commands the would touch the new column.

This feature also serves a role for the legacy service levels
implementation that uses system_distributed for storage: after all nodes
are upgraded to support workload prioritization, one of the nodes will
perform a schema change operation and will add the new column.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
75d2d0d949 build: increase the max number of scheduling groups
Workload prioritization assigns scheduling groups to service levels, and
the number of scheduling groups that can exist at the same time is
limited with a compile-time parameter in seastar. The documentation for
workload prioritization says that we currently support 7 user-managed
service levels and 1 created by default. Increase the current
compile-time limit in order to align with the documentation.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
48e7ffc300 qos: return correct error code when SL does not exist
The `nonexistant_service_level_exception` can be thrown by service
levels code and propagated up to the CQL server layer, where it is
converted into a CQL protocol error. The aforementioned exception
inherits from `service_level_argument_exception`, which in turn inherits
from `std::invalid_argument` - which doesn't mean much to the CQL layer
and is converted to a generic SERVER_ERROR.

We can do better and return a more meaningful error code for this
exception. Change the base class of service_level_argument_exception to
exceptions::invalid_request_exception which gets converted to an INVALID
error.

The INVALID error code was already being used by the enterprise version,
so this commit just synchronizes error handling with enterprise.
2025-01-02 07:13:34 +01:00
Avi Kivity
727f68e0f5 Merge 'cql3: allow SELECT of specific collection element' from Michael Litvak
This adds to the grammar the option to SELECT a specific element in a collection (map/set/list).

For example:
`SELECT map['key'] FROM table`
`SELECT map['key1']['key2'] FROM table`

This feature was implemented in Cassandra 4.0 and was requested by scylla users.

The behavior is mostly compatible with Cassandra, except:
1. in SELECT, we allow list subscript in a selector, while cassandra allows only map and set.
2. in UPDATE, we allow set subscript in a column condition, while cassandra allows only map and list.
3. the slice syntax `SELECT m[a..b]` is not implemented yet
4. null subscript - `SELECT m[null]` returns null in scylla, while cassandra returns error

Fixes #7751

backport was requested for a user to be able to use it

Closes scylladb/scylladb#22051

* github.com:scylladb/scylladb:
  cql3: allow SELECT of specific collection key
  cql3: allow set subscript
2025-01-01 14:48:40 +02:00
Benny Halevy
85bd799308 storage_service: replicate_to_all_cores: prevent stalls when preparing per-table erms
Although the `network_topology_stratergy::make_replication_map` ->
`tablet_aware_replication_strategy::do_make_replication_map`
is not cpu intensive it still allocates and constructs a shared
`tablet_effective_replication_map`, and that might stall with
thousands of tablet-based tables.

Therefore coroutinize the preparation loop to allow yielding.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-31 14:52:39 +01:00
Gleb Natapov
745b6d7d0d gossiper: ignore gossiper entries with local host id in gossiper mode as well
We already ignore a gossiper entries with host id equal to local host id
in raft mode since those entries are just outdated entries since before
ip change. The same logic applies to gossiper mode as well though, so do
the same in both modes.

Fixes: scylladb/scylladb#21930

Message-ID: <Z20kBZvpJ1fP9WyJ@scylladb.com>
2024-12-31 15:50:12 +02:00
Avi Kivity
76cf5148e1 Merge 'message: introduce advanced rpc compression' from Michał Chojnowski
This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows:

After the patch, messaging_service (Scylla's interface for all inter-node communication)
compresses its network traffic with compressors managed by
the new advanced_rpc_compression::tracker. Those compressors compress with lz4,
but can also be configured to use zstd as long as a CPU usage limit isn't crossed.

A precomputed compression dictionary can be fed to the tracker. Each connection
handled by the tracker will then start a negotiation with the other end to switch
to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary.

All traffic going through the tracker is passed as a single merged "stream" through dict_sampler.
dictionary_service has access to the dict_sampler.
On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the alien_worker thread) and publishes the new dictionary
to system.dicts via Raft's write_mutation command.

This update triggers (eventually) a callback on all nodes, which feeds the new dictionary
to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections
to this dictionary.

Closes scylladb/scylladb#22032

* github.com:scylladb/scylladb:
  messaging_service: use advanced_rpc_compression::tracker for compression
  message/dictionary_service: introduce dictionary_service
  service: make Raft group 0 aware of system.dicts
  db/system_keyspace: add system.dicts
  utils: add advanced_rpc_compressor
  utils: add dict_trainer
  utils: introduce reservoir_sampling
  utils: introduce alien_worker
  utils: add stream_compressor
2024-12-31 15:02:57 +02:00
Evgeniy Naydanov
4260f3f55a test.py: topology_random_failures: log randomization parameters in test
Logging randomization parameters in the pytest_generate_tests hook doesn't
play well for us.  To make these parameters more visible move the logging
to the test level.

Closes scylladb/scylladb#22055
2024-12-31 14:23:47 +02:00
Avi Kivity
2b48c2e72a Merge 'build: add support for LTO and PGO to the building system' from Kefu Chai
This changeset ports LTO and PGO support from scylla-enterprise.git to scylladb.git.

Add support for Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO)
to improve performance. LTO provides ~7% performance gain and enables crucial
binary layout optimizations for PGO.

LTO Changes:
- Add `-flto` flag to compile and link steps
- Use `-ffat-lto-objects` to generate both LLVM IR and machine code
- Enable cross-object optimization while maintaining fast test linking

PGO Implementation:
- Implement three-stage build process:
  1. Context-free profiling (`-fprofile-generate`)
  2. Context-sensitive profiling (`-fprofile-use` + `-fcs-profile-generate`)
  3. Final optimization using merged profiles
- Add release-pgo and release-cs-pgo build stages
- Integrate with ninja build system
- Stages can be enabled independently

Profile Management:
- Add `pgo/pgo.py` for workload profile collection
- Store default profile in `pgo/profiles/profile.profdata.xz` using Git LFS
- Add configure.py integration for profile detection and validation
- Support custom profiles via `--use-profile` flag
- Add profile regeneration script

Both optimizations are recommended for maximum performance, though each PGO
stage adds a full build cycle. Future optimization may allow dropping one
PGO stage if performance impact is minimal.

---

this is a forward port, hence no need to backport.

Closes scylladb/scylladb#22039

* github.com:scylladb/scylladb:
  build: cmake: add CMake options for PGO support
  build: cmake: add "Scylla_ENABLE_LTO" option
  build: set LTO and PGO flags for Seastar in cmake build
  build: collect scylla libraries with `scylla_libs` variable
  build: Unify Abseil CXX flags configuration
  configure.py: prepare the build for a default PGO profile in version control
  configure.py: introduce profile-guided optimization
  pgo: add alternator workloads training
  pgo: add a repair workload
  pgo: add a counters workload
  pgo: add a secondary index workload
  pgo: add a LWT workload
  pgo: add a decommission workload
  pgo: add a clustering workload
  pgo: add a basic workload
  pgo: introduce a PGO training script
  configure.py: don't include non-default modes in dist-server-* rules
  configure.py: enable LTO in release builds by default
  configure.py: introduce link-time optimization
  configure.py: add a `default` to `add_tristate`.
  configure.py: unify build rules for cxxbridge .cc files and regular .cc files
2024-12-31 14:14:40 +02:00
Avi Kivity
4905b1bf76 Merge 'table: make update_effective_replication_map sync again' from Benny Halevy
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.

Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786

This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a member and
storage_group_manager::stop() consumes this future during table::stop().

- storage_service: replicate_to_all_cores: update base and view tables atomically

Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)

To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.

* Regression was introduced in f2ff701489, so backport is required to 6.x, 2024.2

Closes scylladb/scylladb#21781

* github.com:scylladb/scylladb:
  storage_service: replicate_to_all_cores: clear_gently pending erms
  test_mv_topology_change: drop delay_after_erm_update injection case
  storage_service: replicate_to_all_cores: update base and view tables atomically
  table: make update_effective_replication_map sync again
2024-12-30 23:42:06 +02:00
Tomasz Grabiec
bf3d0b3543 reader_concurrency_semaphore: Optimize resource_units destruction by postponing wait list processing
Observed 3% throughput improvement in sstable-heavy workload bounded by CPU.

SStable parsing involves lots of buffer operations which obtain and
destroy resource_units. Before the patch, reosurce_unit destruction
invoked maybe_admit_waiters(), which performs some computations on
waiting permits. We don't really need to admit on each change of
resources, since the CPU is used by other things anyway. We can batch
the computation. There is already a fiber which does this for
processing the _ready_list. We can reuse it for processing _wait_list
as well.

The changes violate an assumption made by tests that releasing
resources immediately triggers an admission check. Therefore, some of
the BOOST_REQUIRE_EQUAL needs to be replaced with REQUIRE_EVENTUALLY_EQUAL
as the admision check is now done in the fiber processing the _ready_list.

`perf-simple-query` --tablets --smp 1 -m 1G results obtained for
fixed 400MHz frequency:

Before:
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

112590.60 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41353 insns/op,   17992 cycles/op,        0 errors)
122620.68 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41310 insns/op,   17713 cycles/op,        0 errors)
118169.48 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41353 insns/op,   17857 cycles/op,        0 errors)
120634.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41328 insns/op,   17733 cycles/op,        0 errors)
117317.18 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41347 insns/op,   17822 cycles/op,        0 errors)

         throughput: mean=118266.52 standard-deviation=3797.81 median=118169.48 median-absolute-deviation=2368.13 maximum=122620.68 minimum=112590.60
instructions_per_op: mean=41337.86 standard-deviation=18.73 median=41346.89 median-absolute-deviation=14.64 maximum=41352.53 minimum=41309.83
  cpu_cycles_per_op: mean=17823.50 standard-deviation=111.75 median=17821.97 median-absolute-deviation=90.45 maximum=17992.04 minimum=17713.00
```

After
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

123689.63 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40997 insns/op,   17384 cycles/op,        0 errors)
129643.24 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40997 insns/op,   17325 cycles/op,        0 errors)
128907.27 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41009 insns/op,   17325 cycles/op,        0 errors)
130342.56 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40993 insns/op,   17286 cycles/op,        0 errors)
130294.09 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40972 insns/op,   17336 cycles/op,        0 errors)

         throughput: mean=128575.36 standard-deviation=2792.75 median=129643.24 median-absolute-deviation=1718.73 maximum=130342.56 minimum=123689.63
instructions_per_op: mean=40993.51 standard-deviation=13.23 median=40996.73 median-absolute-deviation=3.30 maximum=41008.86 minimum=40972.48
  cpu_cycles_per_op: mean=17331.16 standard-deviation=35.02 median=17324.84 median-absolute-deviation=6.49 maximum=17383.97 minimum=17286.33
```

Closes scylladb/scylladb#21918

[avi: patch was co-authored by Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>]
2024-12-30 23:37:46 +02:00
Michael Litvak
5ef7afb968 cql3: allow SELECT of specific collection key
This adds to the grammar the option to SELECT a specific key in a
collection column using subscript syntax.

For example:
SELECT map['key'] FROM table
SELECT map['key1']['key2'] FROM table

The key can also be parameterized in a prepared query. For this we need
to pass the query options to result_set_builder where we process the
selectors.

Fixes scylladb/scylladb#7751
2024-12-30 17:05:20 +02:00
Avi Kivity
b32b7ab806 Merge 'test.py: only access combined_tests executable if it is built' from Konstantin Osipov
test.py: only access combined_tests executable if it is built

Fixes #22038

Closes scylladb/scylladb#22069

* github.com:scylladb/scylladb:
  test.py: only access combined_tests if it exists
  test.py: rethrow CancelledError when executing a test
2024-12-30 15:15:39 +02:00
Piotr Smaron
2352063f20 server: set connection_stage to READY when authenticated
If authentication is enabled, but STARTUP isn't followed by REGISTER (which is optional, and in practice only happens on only one of a driver's connections — because there's no point listening for the same events on multiple connections), connections are wrongly displayed in the system.clients as AUTHENTICATING instead of READY, even when they are ready.
This commit fixes this problem.

Fixes: scylladb/scylladb#12640

Closes scylladb/scylladb#21774
2024-12-30 14:04:26 +02:00
Kefu Chai
6281fb825f test/pytest.ini: ignore warning on deprecated record_property fixture
`record_property` generates XML which is not compatible with xunit2,
so pytest decided to deprecated when the generating xunit reports.
and pytest generates following warning when a test failure is
reported using this fixture:

```
  object_store/test_backup.py:337: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
```

this warning is not related to the test, but more about how we
report a failure using pytrest. it is distracting, so let's silence it.

See also https://github.com/pytest-dev/pytest/issues/5202

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22067
2024-12-30 10:58:31 +02:00
Nadav Har'El
27180620af Merge 'topology_random_failures: deselect more cases which can cause #21534' from Evgeniy Naydanov
There are many CI failures (repros of https://github.com/scylladb/scylladb/issues/21534) which caused by `stop_after_setting_mode_to_normal_raft_topology` and `stop_before_becoming_raft_voter` error injections in combination with some cluster events.

Need to deselect them for now to make CI more stable.  First batch deselected in https://github.com/scylladb/scylladb/pull/21658

Also, add the handling of topology state rollback caused by `stop_before_streaming` or `stop_after_updating_cdc_generation` error injections as a separate commit.

See also https://github.com/scylladb/scylladb/issues/21872 and https://github.com/scylladb/scylladb/issues/21957

Closes scylladb/scylladb#22044

* github.com:scylladb/scylladb:
  test.py: topology_random_failures: more deselects for #21534
  test.py: topology_random_failures: handle more node's hangs during 30s sleep
2024-12-30 10:52:22 +02:00
Michael Litvak
2701b5d50d cql3: allow set subscript
This allows to use subscript on a set column, in addition to map/list
which was possible until now.
The behavior is compatible with Cassandra - a subscript with a specific value
returns the value if it's found in the set, and null otherwise.
2024-12-30 09:50:31 +02:00
Konstantin Osipov
8b7a5ca88d test.py: only access combined_tests if it exists
When the scylla source tree is only partially built,
we still may want to run the tests.

test.py builds a case cache at boot, and executes
--list-cases for that, for all built tests.

After amalgamating boost unit tests into a single
file, it started running it unconditionally, which broke
partial builds.

Hence, only use combined_tests executable if it exists.

Fixes #22038
2024-12-27 14:54:13 -05:00
Konstantin Osipov
2b1ba9c3fd test.py: rethrow CancelledError when executing a test
Commit 870f3b00fc,
"Add option to fail after number of failures" adds
tracking on the number of cancelled tests.

For the purpose, it intercepts CancelledError
and sets test's is_cancelled flag.

This introduced a regression reported in gh-21636:
Ctrl-C no longer works, since CancelledError is muted.

There was no intent to mute the exception,
re-throw it after accounting the test as cancelled.
2024-12-27 14:40:47 -05:00
Michał Chojnowski
fdb2d2209c messaging_service: use advanced_rpc_compression::tracker for compression
This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`,
`dict_sampler` and `dictionary_service` in `main()`, and wires them to each other
and to `messaging_service`.

`messaging_service` compresses its network traffic with compressors managed by
the `advanced_rpc_compression::tracker`. All this traffic is passed as a single
merged "stream" through `dict_sampler`.

`dictionary_service` has access to `dict_sampler`.
On chosen nodes (by default: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the `alien_worker` thread) and publishes the new dictionary
to `system.dicts` via Raft.

This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes,
which updates the dictionary used by the compressors it manages.
2024-12-27 10:17:58 +01:00
Kefu Chai
6adf70ec03 build: cmake: add CMake options for PGO support
- "Scylla_BUILD_INSTRUMENTED" option

  Scylla_BUILD_INSTRUMENTED allows us to instrument the code at
  different level, namely, IR, and CSIR. this option mirrors
  "--pgo" and "--cspgo" options in `configure.py` . please note,
  the instrumentation at the frontend is not supported, as the IR
  based instrumentation is better when it comes to the use case of
  optimization for performance.
  see https://lists.llvm.org/pipermail/llvm-dev/2015-August/089044.html
  for the rationales.

- "Scylla_PROFDATA_FILE" option

  this option allows us to specify the profile data previous generated
  with the "Scylla_BUILD_INSTRUMENTED" option. this option mirrors
  the `--use-profile` option in `configure.py`, but it does not
  take the empty option as a special case and consider it as a file
  fetched from Git LFS. that will be handled by another option in a
  follow-up change. please note, one cannot use
  -DScylla_BUILD_INSTRUMENTED=PGO and -DScylla_PROFDATA_FILE=...
  at the same time. clang just does not allow this. but CSPGO is fine.

- "Scylla_PROFDATA_COMPRESSED_FILE" option

  this option allows us to specify the compressed profile data previouly
  generated with the "Scylla_BUILD_INSTRUMENTED" option. along with
  "Scylla_PROFDATA_FILE", this option mirros the functionality of
  `--use-profile` in `configure.py`. the goal is to ensure user always
  gets the result with the specified options. if anything goes wrong,
  we just error out.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
4154789670 build: cmake: add "Scylla_ENABLE_LTO" option
add an option named "Scylla_ENABLE_LTO", which is off by default.
if it is on, build the whole tree with ThinLTO enabled.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
2647369d46 build: set LTO and PGO flags for Seastar in cmake build
This change extends scylla commit 7cb74df to scylla-enterprise-commit
4ece7e1.

we recently started building Seastar as an external project, so
we need to prepare its compilation flags separately. in enterprise
scylla, we prepare the LTO and PGO related cflags in
`prepare_advanced_optimizations()`. this function is called when
preparing the build rules directly from `configure.py`, and despite
we have equivalant settings in CMake, they cannot be applied to Seastar
due to the reason above.

in this change, we set up the the LTO and PGO compilation flags when
generating the buiding system for Seastar when building using CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
ffe8c5dcdb build: collect scylla libraries with scylla_libs variable
with which, we can set the properties of these targets in a single place.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
610f1b7a0a build: Unify Abseil CXX flags configuration
- Set ABSL_GCC_FLAGS and ABSL_LLVM_FLAGS with a more generic absl_cxx_flags
- Enables more flexible configuration of compiler flags for Abseil libraries
- Provides a centralized approach to setting compilation flags

Previously, sanitizer-specific flags were directly applied to Abseil library builds.
This change allows for more extensible compiling flag management across
different build configurations.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Michał Chojnowski
131b1d6f81 configure.py: prepare the build for a default PGO profile in version control
This patch adds the following logic to the release build:

pgo/profiles/profile.profdata.xz is the default profile file, compressed.
This file is stored in version control using git LFS.
A ninja rule is added which creates build/profile.profdata by decompressing it.

If no profile file is explicitly specified, ./configure.py checks whether
the compressed default profile file exists and is compressed.
(If it exists, but isn't compressed, the user most likely has
git lfs disabled or not installed. In this case, the file visible in the working
tree will be the LFS placeholder text file describing the LFS metadata.)

If the compressed file exists, build/profile.profdata is chosen as the used
profile file.
If it doesn't exist, a warning is printed and configure.py falls back
to a profileless build.

The default profile file can be explicitly disabled by passing the empty
--use-profile="" to configure.py

A script is added which re-generates the profile.
After the script is run, the re-generated compressed profile can be staged,
committed, pushed and merged to update the default profile.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
a868b44ad8 configure.py: introduce profile-guided optimization
This commit enables profile-guided optimizations (PGO) in the Scylla build.

A full LLVM PGO requires 3 builds:
1. With -fprofile-generate to generate context-free (pre-inlining) profile. This
profile influences inlining, indirect-call promotion and call graph
simplifications.
2. With -fprofile-use=results_of_build_1 -fcs-profile-generate to generate
context-sensitive (post-inlining) profile. This profile influences post-inline
and codegen optimizations.
3. With -fprofile-use=merged_results_of_builds_1_2 to build the final binary
with both profiles.

We do all three in one ninja call by adding release-pgo and release-cs-pgo
"stages" to release. They are a copy of regular release mode, just with the
flags described above added. With the full course, release objects depend on the
profile file produced by build/release-cs-pgo/scylla, while release-cs-pgo
depends on the profile file generated by build/release-pgo/scylla.

The stages are orthogonal and enabled with separate options. It's recommended
to run them both for full performance, but unfortunately each one adds a full
build of scylla to the compile time, so maybe we can drop one of them in the
future if it turns out e.g. that regular PGO doesn't have a big effect.

It's strongly recommended to combine PGO with LTO. The latter enables the entire
class of binary layout optimizations, which for us is probably the most
important part of the entire thing.
2024-12-27 16:16:04 +08:00
Marcin Maliszkiewicz
80989556ac pgo: add alternator workloads training
This patch adds a set of alternator workloads to pgo training
script.

To confirm that added workloads are indeed affecting profile we can compare:

⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/clustering/prof.profdata

Instrumentation level: IR  entry_first = 0
Total functions: 105075
Maximum function count: 1079870885
Maximum internal block count: 2197851358

and

⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/alternator/prof.profdata

Instrumentation level: IR  entry_first = 0
Total functions: 105075
Maximum function count: 5240506052
Maximum internal block count: 9112894084

to see that function counters are on similar levels, they are around 5x higher for alternator
but that's because it combines 5 specific sub-workloads.

To confirm that final profile contains alterantor functions we can inspect:

⤖ llvm-profdata show --counts --function=alternator --value-cutoff 100000 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR  entry_first = 0
Functions shown: 356
Total functions: 105075
Number of functions with maximum count (< 100000): 97275
Number of functions with maximum count (>= 100000): 7800
Maximum function count: 7248370728
Maximum internal block count: 13722347326

we can see that 356 functions which symbol name contains word alternator were identified as 'hot' (with max count grater than 100'000). Running:

⤖ llvm-profdata show --counts --function=alternator --value-cutoff 1 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR  entry_first = 0
Functions shown: 806
Total functions: 105075
Number of functions with maximum count (< 1): 67036
Number of functions with maximum count (>= 1): 38039
Maximum function count: 7248370728
Maximum internal block count: 13722347326

we can see that 806 alternator functions were executed at least once during training.

And finally to confirm that alternator specific PGO brings any speedups we run:

for workload in read scan write write_gsi write_rmw
do
./build/release/scylla perf-alternator-workloads --smp 4 --cpuset "10,12,14,16" --workload $workload --duration 1 --remote-host 127.0.0.1 2> /dev/null | grep median
done

results BEFORE:

median 258137.51910849303
median absolute deviation: 786.06
median 547.2578202937141
median absolute deviation: 6.33
median 145718.19856685458
median absolute deviation: 5689.79
median 89024.67095807113
median absolute deviation: 1302.56
median 43708.101729598646
median absolute deviation: 294.47

results AFTER:

median 303968.55333940056
median absolute deviation: 1152.19
median 622.4757636209254
median absolute deviation: 8.42
median 198566.0403745328
median absolute deviation: 1689.96
median 91696.44912842038
median absolute deviation: 1891.84
median 51445.356525664996
median absolute deviation: 1780.15

We can see that single node cluster tps increase is typically 13% - 17% with notable exceptions,
improvement for write_gsi is 3% and for write workload whopping 36%.
The increase is on top of CQL PGO.

Write workload is executed more often because it's involved also as data preparation for read and scan.
Some further improvement could be to separate preparation from training as it's done for CQL but it would
be a bit odd if ~3x higher counters for one flow have so big impact.

Additional disclaimers:
 - tests are performing exactly the same workloads as in training so there might be some bias
 - tests are running single node cluster, more realistic setup will likely show lower improvement

Fixes https://github.com/scylladb/scylla-enterprise/issues/4066
2024-12-27 16:16:04 +08:00
Michał Chojnowski
95c8d88b96 pgo: add a repair workload
This workload is added to teach PGO about repair.
Tests are inconclusive about its alignment with existing workloads,
because repair doesn't seem utilize 100% of the reactor.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
1c9ce0a9ee pgo: add a counters workload
This workload is added to teach PGO about counters.
Tests seem to show it's mostly aligned with existing CQL workloads.

The config YAML is based on the default cassandra-stress schema.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
47dc0399cb pgo: add a secondary index workload
This workload is added to teach PGO about secondary indexes.
Tests seem to show that it's mostly aligned with existing CQL workloads.

The config YAML was copied from one of scylla-cluster-test test cases.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
e67f4a5c51 pgo: add a LWT workload
This workload is added to teach PGO about LWT codepaths.
Tests seem to show that it's mostly aligned with existing CQL workloads.

The config YAML was copied from one of scylla-cluster-tests test cases.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
e217c124a6 pgo: add a decommission workload
This workload is added to teach PGO about streaming.
Tests show that this workload is mostly orthogonal to CQL workloads
(where "orthogonal" means that training on workload A doesn't improve workload
B much, while training on workload A doesn't improve workload B much),
so adding it to the training is quite important.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
65abecaede pgo: add a clustering workload
In contrast to the basic workload, this workload uses clustering
keys, CK range queries, RF=1, logged batches, and more CQL types.
Tests seem to show that this workload is mostly aligned with the existing basic
workload (where "aligned" means that training on workload A improves workload B
about as much as training on workload B).

The config YAML is based on the example YAML attached to cassandra-stress
sources.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
c1297dbcd2 pgo: add a basic workload
This commit adds the default cassandra-stress workload to the PGO training
suite.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
f73b122de3 pgo: introduce a PGO training script
Profile-guided optimization consists of the following steps:
1. Build the program as usual, but with with special options (instrumentation
or just some supplementary info tables, depending on the exact flavor of PGO
in use).
2. Collect an execution profile from the special binary by running a
training workload on it.
3. Rebuild the program again, using the collected profile.

This commit introduces a script automating step 2: running PGO training workloads
on Scylla. The contents of training workloads will be added in future commits.
The changes in configure.py responsible for steps 1. and 3. will also appear
in future commits.

As input, the script takes a path to the instrumented binary, a path to a
the output file, and a directory with (optionally) prepopulated datasets for use
in training. The output profile file can be then passed to the compiler to
perform a PGO build.

The script current supports two kinds of PGO instrumentation: LLVM instrumentation
(binary instrumented with -fprofile-generate and -fcs-profile-generate passed to
clang during compilation) and BOLT instrumentation (binary instrumented with
`llvm-bolt -instrument`, with logs from this operation saved to
$binary_path.boltlog)

The actual training workloads for generating the profile will be added in later
commits.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
6f01ceae3d configure.py: don't include non-default modes in dist-server-* rules
dist-server-tar only includes default modes. Let dist-server-deb
and dist-server-rpm behave consistently with it.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
dd1a847d61 configure.py: enable LTO in release builds by default 2024-12-27 16:16:04 +08:00