Commit Graph

45994 Commits

Author SHA1 Message Date
Gleb Natapov
acbc667d3e storage_service: set raft topology change mode before using it in join_cluster
ss::join_cluster calls raft_topology_change_enabled() before the mode is
initialized below in the same function. Fix it by changing the order.
2025-01-02 18:44:19 +02:00
Gleb Natapov
491b7232de locator: drop inet_address usage to figure out per dc/rack replication
It allows to correctly calculate replication map even without knowing
IPs of the nodes.
2025-01-02 18:44:19 +02:00
Gleb Natapov
c4b26ba8dc test: drop test_old_ip_notification_repro.py
The test no longer test anything since the address map is updated much
earlier now by the gossiper itself, not by the notifiers. The
functionality is tested by a unit test now.
2025-01-01 12:43:11 +02:00
Gleb Natapov
c4db90799a test: address_map: check generation handling during entry addition
Check that adding an entry with smaller generation does not overwrite
existing entry.
2025-01-01 12:43:11 +02:00
Benny Halevy
85bd799308 storage_service: replicate_to_all_cores: prevent stalls when preparing per-table erms
Although the `network_topology_stratergy::make_replication_map` ->
`tablet_aware_replication_strategy::do_make_replication_map`
is not cpu intensive it still allocates and constructs a shared
`tablet_effective_replication_map`, and that might stall with
thousands of tablet-based tables.

Therefore coroutinize the preparation loop to allow yielding.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-31 14:52:39 +01:00
Gleb Natapov
745b6d7d0d gossiper: ignore gossiper entries with local host id in gossiper mode as well
We already ignore a gossiper entries with host id equal to local host id
in raft mode since those entries are just outdated entries since before
ip change. The same logic applies to gossiper mode as well though, so do
the same in both modes.

Fixes: scylladb/scylladb#21930

Message-ID: <Z20kBZvpJ1fP9WyJ@scylladb.com>
2024-12-31 15:50:12 +02:00
Avi Kivity
76cf5148e1 Merge 'message: introduce advanced rpc compression' from Michał Chojnowski
This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows:

After the patch, messaging_service (Scylla's interface for all inter-node communication)
compresses its network traffic with compressors managed by
the new advanced_rpc_compression::tracker. Those compressors compress with lz4,
but can also be configured to use zstd as long as a CPU usage limit isn't crossed.

A precomputed compression dictionary can be fed to the tracker. Each connection
handled by the tracker will then start a negotiation with the other end to switch
to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary.

All traffic going through the tracker is passed as a single merged "stream" through dict_sampler.
dictionary_service has access to the dict_sampler.
On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the alien_worker thread) and publishes the new dictionary
to system.dicts via Raft's write_mutation command.

This update triggers (eventually) a callback on all nodes, which feeds the new dictionary
to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections
to this dictionary.

Closes scylladb/scylladb#22032

* github.com:scylladb/scylladb:
  messaging_service: use advanced_rpc_compression::tracker for compression
  message/dictionary_service: introduce dictionary_service
  service: make Raft group 0 aware of system.dicts
  db/system_keyspace: add system.dicts
  utils: add advanced_rpc_compressor
  utils: add dict_trainer
  utils: introduce reservoir_sampling
  utils: introduce alien_worker
  utils: add stream_compressor
2024-12-31 15:02:57 +02:00
Evgeniy Naydanov
4260f3f55a test.py: topology_random_failures: log randomization parameters in test
Logging randomization parameters in the pytest_generate_tests hook doesn't
play well for us.  To make these parameters more visible move the logging
to the test level.

Closes scylladb/scylladb#22055
2024-12-31 14:23:47 +02:00
Avi Kivity
2b48c2e72a Merge 'build: add support for LTO and PGO to the building system' from Kefu Chai
This changeset ports LTO and PGO support from scylla-enterprise.git to scylladb.git.

Add support for Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO)
to improve performance. LTO provides ~7% performance gain and enables crucial
binary layout optimizations for PGO.

LTO Changes:
- Add `-flto` flag to compile and link steps
- Use `-ffat-lto-objects` to generate both LLVM IR and machine code
- Enable cross-object optimization while maintaining fast test linking

PGO Implementation:
- Implement three-stage build process:
  1. Context-free profiling (`-fprofile-generate`)
  2. Context-sensitive profiling (`-fprofile-use` + `-fcs-profile-generate`)
  3. Final optimization using merged profiles
- Add release-pgo and release-cs-pgo build stages
- Integrate with ninja build system
- Stages can be enabled independently

Profile Management:
- Add `pgo/pgo.py` for workload profile collection
- Store default profile in `pgo/profiles/profile.profdata.xz` using Git LFS
- Add configure.py integration for profile detection and validation
- Support custom profiles via `--use-profile` flag
- Add profile regeneration script

Both optimizations are recommended for maximum performance, though each PGO
stage adds a full build cycle. Future optimization may allow dropping one
PGO stage if performance impact is minimal.

---

this is a forward port, hence no need to backport.

Closes scylladb/scylladb#22039

* github.com:scylladb/scylladb:
  build: cmake: add CMake options for PGO support
  build: cmake: add "Scylla_ENABLE_LTO" option
  build: set LTO and PGO flags for Seastar in cmake build
  build: collect scylla libraries with `scylla_libs` variable
  build: Unify Abseil CXX flags configuration
  configure.py: prepare the build for a default PGO profile in version control
  configure.py: introduce profile-guided optimization
  pgo: add alternator workloads training
  pgo: add a repair workload
  pgo: add a counters workload
  pgo: add a secondary index workload
  pgo: add a LWT workload
  pgo: add a decommission workload
  pgo: add a clustering workload
  pgo: add a basic workload
  pgo: introduce a PGO training script
  configure.py: don't include non-default modes in dist-server-* rules
  configure.py: enable LTO in release builds by default
  configure.py: introduce link-time optimization
  configure.py: add a `default` to `add_tristate`.
  configure.py: unify build rules for cxxbridge .cc files and regular .cc files
2024-12-31 14:14:40 +02:00
Avi Kivity
4905b1bf76 Merge 'table: make update_effective_replication_map sync again' from Benny Halevy
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.

Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786

This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a member and
storage_group_manager::stop() consumes this future during table::stop().

- storage_service: replicate_to_all_cores: update base and view tables atomically

Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)

To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.

* Regression was introduced in f2ff701489, so backport is required to 6.x, 2024.2

Closes scylladb/scylladb#21781

* github.com:scylladb/scylladb:
  storage_service: replicate_to_all_cores: clear_gently pending erms
  test_mv_topology_change: drop delay_after_erm_update injection case
  storage_service: replicate_to_all_cores: update base and view tables atomically
  table: make update_effective_replication_map sync again
2024-12-30 23:42:06 +02:00
Tomasz Grabiec
bf3d0b3543 reader_concurrency_semaphore: Optimize resource_units destruction by postponing wait list processing
Observed 3% throughput improvement in sstable-heavy workload bounded by CPU.

SStable parsing involves lots of buffer operations which obtain and
destroy resource_units. Before the patch, reosurce_unit destruction
invoked maybe_admit_waiters(), which performs some computations on
waiting permits. We don't really need to admit on each change of
resources, since the CPU is used by other things anyway. We can batch
the computation. There is already a fiber which does this for
processing the _ready_list. We can reuse it for processing _wait_list
as well.

The changes violate an assumption made by tests that releasing
resources immediately triggers an admission check. Therefore, some of
the BOOST_REQUIRE_EQUAL needs to be replaced with REQUIRE_EVENTUALLY_EQUAL
as the admision check is now done in the fiber processing the _ready_list.

`perf-simple-query` --tablets --smp 1 -m 1G results obtained for
fixed 400MHz frequency:

Before:
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

112590.60 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41353 insns/op,   17992 cycles/op,        0 errors)
122620.68 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41310 insns/op,   17713 cycles/op,        0 errors)
118169.48 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41353 insns/op,   17857 cycles/op,        0 errors)
120634.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41328 insns/op,   17733 cycles/op,        0 errors)
117317.18 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41347 insns/op,   17822 cycles/op,        0 errors)

         throughput: mean=118266.52 standard-deviation=3797.81 median=118169.48 median-absolute-deviation=2368.13 maximum=122620.68 minimum=112590.60
instructions_per_op: mean=41337.86 standard-deviation=18.73 median=41346.89 median-absolute-deviation=14.64 maximum=41352.53 minimum=41309.83
  cpu_cycles_per_op: mean=17823.50 standard-deviation=111.75 median=17821.97 median-absolute-deviation=90.45 maximum=17992.04 minimum=17713.00
```

After
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

123689.63 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40997 insns/op,   17384 cycles/op,        0 errors)
129643.24 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40997 insns/op,   17325 cycles/op,        0 errors)
128907.27 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41009 insns/op,   17325 cycles/op,        0 errors)
130342.56 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40993 insns/op,   17286 cycles/op,        0 errors)
130294.09 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40972 insns/op,   17336 cycles/op,        0 errors)

         throughput: mean=128575.36 standard-deviation=2792.75 median=129643.24 median-absolute-deviation=1718.73 maximum=130342.56 minimum=123689.63
instructions_per_op: mean=40993.51 standard-deviation=13.23 median=40996.73 median-absolute-deviation=3.30 maximum=41008.86 minimum=40972.48
  cpu_cycles_per_op: mean=17331.16 standard-deviation=35.02 median=17324.84 median-absolute-deviation=6.49 maximum=17383.97 minimum=17286.33
```

Closes scylladb/scylladb#21918

[avi: patch was co-authored by Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>]
2024-12-30 23:37:46 +02:00
Avi Kivity
b32b7ab806 Merge 'test.py: only access combined_tests executable if it is built' from Konstantin Osipov
test.py: only access combined_tests executable if it is built

Fixes #22038

Closes scylladb/scylladb#22069

* github.com:scylladb/scylladb:
  test.py: only access combined_tests if it exists
  test.py: rethrow CancelledError when executing a test
2024-12-30 15:15:39 +02:00
Piotr Smaron
2352063f20 server: set connection_stage to READY when authenticated
If authentication is enabled, but STARTUP isn't followed by REGISTER (which is optional, and in practice only happens on only one of a driver's connections — because there's no point listening for the same events on multiple connections), connections are wrongly displayed in the system.clients as AUTHENTICATING instead of READY, even when they are ready.
This commit fixes this problem.

Fixes: scylladb/scylladb#12640

Closes scylladb/scylladb#21774
2024-12-30 14:04:26 +02:00
Kefu Chai
6281fb825f test/pytest.ini: ignore warning on deprecated record_property fixture
`record_property` generates XML which is not compatible with xunit2,
so pytest decided to deprecated when the generating xunit reports.
and pytest generates following warning when a test failure is
reported using this fixture:

```
  object_store/test_backup.py:337: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
```

this warning is not related to the test, but more about how we
report a failure using pytrest. it is distracting, so let's silence it.

See also https://github.com/pytest-dev/pytest/issues/5202

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22067
2024-12-30 10:58:31 +02:00
Nadav Har'El
27180620af Merge 'topology_random_failures: deselect more cases which can cause #21534' from Evgeniy Naydanov
There are many CI failures (repros of https://github.com/scylladb/scylladb/issues/21534) which caused by `stop_after_setting_mode_to_normal_raft_topology` and `stop_before_becoming_raft_voter` error injections in combination with some cluster events.

Need to deselect them for now to make CI more stable.  First batch deselected in https://github.com/scylladb/scylladb/pull/21658

Also, add the handling of topology state rollback caused by `stop_before_streaming` or `stop_after_updating_cdc_generation` error injections as a separate commit.

See also https://github.com/scylladb/scylladb/issues/21872 and https://github.com/scylladb/scylladb/issues/21957

Closes scylladb/scylladb#22044

* github.com:scylladb/scylladb:
  test.py: topology_random_failures: more deselects for #21534
  test.py: topology_random_failures: handle more node's hangs during 30s sleep
2024-12-30 10:52:22 +02:00
Konstantin Osipov
8b7a5ca88d test.py: only access combined_tests if it exists
When the scylla source tree is only partially built,
we still may want to run the tests.

test.py builds a case cache at boot, and executes
--list-cases for that, for all built tests.

After amalgamating boost unit tests into a single
file, it started running it unconditionally, which broke
partial builds.

Hence, only use combined_tests executable if it exists.

Fixes #22038
2024-12-27 14:54:13 -05:00
Konstantin Osipov
2b1ba9c3fd test.py: rethrow CancelledError when executing a test
Commit 870f3b00fc,
"Add option to fail after number of failures" adds
tracking on the number of cancelled tests.

For the purpose, it intercepts CancelledError
and sets test's is_cancelled flag.

This introduced a regression reported in gh-21636:
Ctrl-C no longer works, since CancelledError is muted.

There was no intent to mute the exception,
re-throw it after accounting the test as cancelled.
2024-12-27 14:40:47 -05:00
Michał Chojnowski
fdb2d2209c messaging_service: use advanced_rpc_compression::tracker for compression
This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`,
`dict_sampler` and `dictionary_service` in `main()`, and wires them to each other
and to `messaging_service`.

`messaging_service` compresses its network traffic with compressors managed by
the `advanced_rpc_compression::tracker`. All this traffic is passed as a single
merged "stream" through `dict_sampler`.

`dictionary_service` has access to `dict_sampler`.
On chosen nodes (by default: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the `alien_worker` thread) and publishes the new dictionary
to `system.dicts` via Raft.

This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes,
which updates the dictionary used by the compressors it manages.
2024-12-27 10:17:58 +01:00
Kefu Chai
6adf70ec03 build: cmake: add CMake options for PGO support
- "Scylla_BUILD_INSTRUMENTED" option

  Scylla_BUILD_INSTRUMENTED allows us to instrument the code at
  different level, namely, IR, and CSIR. this option mirrors
  "--pgo" and "--cspgo" options in `configure.py` . please note,
  the instrumentation at the frontend is not supported, as the IR
  based instrumentation is better when it comes to the use case of
  optimization for performance.
  see https://lists.llvm.org/pipermail/llvm-dev/2015-August/089044.html
  for the rationales.

- "Scylla_PROFDATA_FILE" option

  this option allows us to specify the profile data previous generated
  with the "Scylla_BUILD_INSTRUMENTED" option. this option mirrors
  the `--use-profile` option in `configure.py`, but it does not
  take the empty option as a special case and consider it as a file
  fetched from Git LFS. that will be handled by another option in a
  follow-up change. please note, one cannot use
  -DScylla_BUILD_INSTRUMENTED=PGO and -DScylla_PROFDATA_FILE=...
  at the same time. clang just does not allow this. but CSPGO is fine.

- "Scylla_PROFDATA_COMPRESSED_FILE" option

  this option allows us to specify the compressed profile data previouly
  generated with the "Scylla_BUILD_INSTRUMENTED" option. along with
  "Scylla_PROFDATA_FILE", this option mirros the functionality of
  `--use-profile` in `configure.py`. the goal is to ensure user always
  gets the result with the specified options. if anything goes wrong,
  we just error out.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
4154789670 build: cmake: add "Scylla_ENABLE_LTO" option
add an option named "Scylla_ENABLE_LTO", which is off by default.
if it is on, build the whole tree with ThinLTO enabled.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
2647369d46 build: set LTO and PGO flags for Seastar in cmake build
This change extends scylla commit 7cb74df to scylla-enterprise-commit
4ece7e1.

we recently started building Seastar as an external project, so
we need to prepare its compilation flags separately. in enterprise
scylla, we prepare the LTO and PGO related cflags in
`prepare_advanced_optimizations()`. this function is called when
preparing the build rules directly from `configure.py`, and despite
we have equivalant settings in CMake, they cannot be applied to Seastar
due to the reason above.

in this change, we set up the the LTO and PGO compilation flags when
generating the buiding system for Seastar when building using CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
ffe8c5dcdb build: collect scylla libraries with scylla_libs variable
with which, we can set the properties of these targets in a single place.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
610f1b7a0a build: Unify Abseil CXX flags configuration
- Set ABSL_GCC_FLAGS and ABSL_LLVM_FLAGS with a more generic absl_cxx_flags
- Enables more flexible configuration of compiler flags for Abseil libraries
- Provides a centralized approach to setting compilation flags

Previously, sanitizer-specific flags were directly applied to Abseil library builds.
This change allows for more extensible compiling flag management across
different build configurations.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Michał Chojnowski
131b1d6f81 configure.py: prepare the build for a default PGO profile in version control
This patch adds the following logic to the release build:

pgo/profiles/profile.profdata.xz is the default profile file, compressed.
This file is stored in version control using git LFS.
A ninja rule is added which creates build/profile.profdata by decompressing it.

If no profile file is explicitly specified, ./configure.py checks whether
the compressed default profile file exists and is compressed.
(If it exists, but isn't compressed, the user most likely has
git lfs disabled or not installed. In this case, the file visible in the working
tree will be the LFS placeholder text file describing the LFS metadata.)

If the compressed file exists, build/profile.profdata is chosen as the used
profile file.
If it doesn't exist, a warning is printed and configure.py falls back
to a profileless build.

The default profile file can be explicitly disabled by passing the empty
--use-profile="" to configure.py

A script is added which re-generates the profile.
After the script is run, the re-generated compressed profile can be staged,
committed, pushed and merged to update the default profile.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
a868b44ad8 configure.py: introduce profile-guided optimization
This commit enables profile-guided optimizations (PGO) in the Scylla build.

A full LLVM PGO requires 3 builds:
1. With -fprofile-generate to generate context-free (pre-inlining) profile. This
profile influences inlining, indirect-call promotion and call graph
simplifications.
2. With -fprofile-use=results_of_build_1 -fcs-profile-generate to generate
context-sensitive (post-inlining) profile. This profile influences post-inline
and codegen optimizations.
3. With -fprofile-use=merged_results_of_builds_1_2 to build the final binary
with both profiles.

We do all three in one ninja call by adding release-pgo and release-cs-pgo
"stages" to release. They are a copy of regular release mode, just with the
flags described above added. With the full course, release objects depend on the
profile file produced by build/release-cs-pgo/scylla, while release-cs-pgo
depends on the profile file generated by build/release-pgo/scylla.

The stages are orthogonal and enabled with separate options. It's recommended
to run them both for full performance, but unfortunately each one adds a full
build of scylla to the compile time, so maybe we can drop one of them in the
future if it turns out e.g. that regular PGO doesn't have a big effect.

It's strongly recommended to combine PGO with LTO. The latter enables the entire
class of binary layout optimizations, which for us is probably the most
important part of the entire thing.
2024-12-27 16:16:04 +08:00
Marcin Maliszkiewicz
80989556ac pgo: add alternator workloads training
This patch adds a set of alternator workloads to pgo training
script.

To confirm that added workloads are indeed affecting profile we can compare:

⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/clustering/prof.profdata

Instrumentation level: IR  entry_first = 0
Total functions: 105075
Maximum function count: 1079870885
Maximum internal block count: 2197851358

and

⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/alternator/prof.profdata

Instrumentation level: IR  entry_first = 0
Total functions: 105075
Maximum function count: 5240506052
Maximum internal block count: 9112894084

to see that function counters are on similar levels, they are around 5x higher for alternator
but that's because it combines 5 specific sub-workloads.

To confirm that final profile contains alterantor functions we can inspect:

⤖ llvm-profdata show --counts --function=alternator --value-cutoff 100000 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR  entry_first = 0
Functions shown: 356
Total functions: 105075
Number of functions with maximum count (< 100000): 97275
Number of functions with maximum count (>= 100000): 7800
Maximum function count: 7248370728
Maximum internal block count: 13722347326

we can see that 356 functions which symbol name contains word alternator were identified as 'hot' (with max count grater than 100'000). Running:

⤖ llvm-profdata show --counts --function=alternator --value-cutoff 1 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR  entry_first = 0
Functions shown: 806
Total functions: 105075
Number of functions with maximum count (< 1): 67036
Number of functions with maximum count (>= 1): 38039
Maximum function count: 7248370728
Maximum internal block count: 13722347326

we can see that 806 alternator functions were executed at least once during training.

And finally to confirm that alternator specific PGO brings any speedups we run:

for workload in read scan write write_gsi write_rmw
do
./build/release/scylla perf-alternator-workloads --smp 4 --cpuset "10,12,14,16" --workload $workload --duration 1 --remote-host 127.0.0.1 2> /dev/null | grep median
done

results BEFORE:

median 258137.51910849303
median absolute deviation: 786.06
median 547.2578202937141
median absolute deviation: 6.33
median 145718.19856685458
median absolute deviation: 5689.79
median 89024.67095807113
median absolute deviation: 1302.56
median 43708.101729598646
median absolute deviation: 294.47

results AFTER:

median 303968.55333940056
median absolute deviation: 1152.19
median 622.4757636209254
median absolute deviation: 8.42
median 198566.0403745328
median absolute deviation: 1689.96
median 91696.44912842038
median absolute deviation: 1891.84
median 51445.356525664996
median absolute deviation: 1780.15

We can see that single node cluster tps increase is typically 13% - 17% with notable exceptions,
improvement for write_gsi is 3% and for write workload whopping 36%.
The increase is on top of CQL PGO.

Write workload is executed more often because it's involved also as data preparation for read and scan.
Some further improvement could be to separate preparation from training as it's done for CQL but it would
be a bit odd if ~3x higher counters for one flow have so big impact.

Additional disclaimers:
 - tests are performing exactly the same workloads as in training so there might be some bias
 - tests are running single node cluster, more realistic setup will likely show lower improvement

Fixes https://github.com/scylladb/scylla-enterprise/issues/4066
2024-12-27 16:16:04 +08:00
Michał Chojnowski
95c8d88b96 pgo: add a repair workload
This workload is added to teach PGO about repair.
Tests are inconclusive about its alignment with existing workloads,
because repair doesn't seem utilize 100% of the reactor.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
1c9ce0a9ee pgo: add a counters workload
This workload is added to teach PGO about counters.
Tests seem to show it's mostly aligned with existing CQL workloads.

The config YAML is based on the default cassandra-stress schema.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
47dc0399cb pgo: add a secondary index workload
This workload is added to teach PGO about secondary indexes.
Tests seem to show that it's mostly aligned with existing CQL workloads.

The config YAML was copied from one of scylla-cluster-test test cases.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
e67f4a5c51 pgo: add a LWT workload
This workload is added to teach PGO about LWT codepaths.
Tests seem to show that it's mostly aligned with existing CQL workloads.

The config YAML was copied from one of scylla-cluster-tests test cases.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
e217c124a6 pgo: add a decommission workload
This workload is added to teach PGO about streaming.
Tests show that this workload is mostly orthogonal to CQL workloads
(where "orthogonal" means that training on workload A doesn't improve workload
B much, while training on workload A doesn't improve workload B much),
so adding it to the training is quite important.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
65abecaede pgo: add a clustering workload
In contrast to the basic workload, this workload uses clustering
keys, CK range queries, RF=1, logged batches, and more CQL types.
Tests seem to show that this workload is mostly aligned with the existing basic
workload (where "aligned" means that training on workload A improves workload B
about as much as training on workload B).

The config YAML is based on the example YAML attached to cassandra-stress
sources.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
c1297dbcd2 pgo: add a basic workload
This commit adds the default cassandra-stress workload to the PGO training
suite.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
f73b122de3 pgo: introduce a PGO training script
Profile-guided optimization consists of the following steps:
1. Build the program as usual, but with with special options (instrumentation
or just some supplementary info tables, depending on the exact flavor of PGO
in use).
2. Collect an execution profile from the special binary by running a
training workload on it.
3. Rebuild the program again, using the collected profile.

This commit introduces a script automating step 2: running PGO training workloads
on Scylla. The contents of training workloads will be added in future commits.
The changes in configure.py responsible for steps 1. and 3. will also appear
in future commits.

As input, the script takes a path to the instrumented binary, a path to a
the output file, and a directory with (optionally) prepopulated datasets for use
in training. The output profile file can be then passed to the compiler to
perform a PGO build.

The script current supports two kinds of PGO instrumentation: LLVM instrumentation
(binary instrumented with -fprofile-generate and -fcs-profile-generate passed to
clang during compilation) and BOLT instrumentation (binary instrumented with
`llvm-bolt -instrument`, with logs from this operation saved to
$binary_path.boltlog)

The actual training workloads for generating the profile will be added in later
commits.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
6f01ceae3d configure.py: don't include non-default modes in dist-server-* rules
dist-server-tar only includes default modes. Let dist-server-deb
and dist-server-rpm behave consistently with it.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
dd1a847d61 configure.py: enable LTO in release builds by default 2024-12-27 16:16:04 +08:00
Michał Chojnowski
4b03b91fbd configure.py: introduce link-time optimization
This patch introduces link-time optimization (LTO) to the build.

The performance gains from LTO alone are modest (~7%), but it's vital ingredient
of effective profile-guided optimization, which will be introduced later.

In general, use of LTO is quite simple and transparent to build systems.
It is sufficient to add the -flto flag to compile and link steps, and use a
LTO-aware linker.
At compile time, -ffat-lto-objects will cause the compiler to emit .o
files both LTO-ready LLVM IR for main executable optimization and machine
code for fast test linking. At link time, those pieces of IR will be
compiled together, allowing cross-object optimization of the main
executable and the fast linking of test executables.

Due to it's high compile time cost, the optimization can be toggled with a
configure.py option. As of this patch, it's disabled by default.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
192cb6de4b configure.py: add a default to add_tristate.
It will be used in the next patch.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
1224200d7a configure.py: unify build rules for cxxbridge .cc files and regular .cc files
This is going to prevent some code duplication in following patches.
2024-12-27 16:16:04 +08:00
Benny Halevy
3e22998dc1 sstables: parse(summary): reserve positions vector
We know the number of positions in advance
so reserve the chunked_vector capacity for that.

Note: reservation replaces the existing reset of the
positions member.  This is safe since we parse the summary
only once as sstable::read_summary() returns early
if the summary component is already populated.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21767
2024-12-26 13:33:29 +02:00
Yaron Kaikov
bc487c9456 .github: cherry-pick each commit instead of merge commit when available
Until today, when we had a PR with multiple commits we cherry-pick the merge commit only, which created a PR with only one commit (the merge commit) with all relevant changes

This was causing an issue when there was a need to backport part of the commits like in https://github.com/scylladb/scylladb/pull/21990 (reported by @gleb-cloudius)

Changing the logic to cherry-pick each commit

Closes scylladb/scylladb#22027
2024-12-26 13:10:18 +02:00
Kefu Chai
6acc5294a4 treewide: migrate from boost::copy_range to std::ranges::to
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::to`.

in this change, we:

- replace `boost::copy_range` to `std::ranges::to`
- remove unused `#include` of boost headers

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21880
2024-12-26 11:46:26 +02:00
Kefu Chai
6c031ad92f test/topology: Percent-encode URL in pytest artifact links
When embedding HTML documents in pytest reports with links to test artifacts,
parameterized test names containing special characters like "[" and "]" can
cause URL encoding issues. These characters, when used verbatim in URLs, can
trigger HTTP 400 errors on web servers.

This commit resolves the issue by percent-encoding the URLs for artifact links,
ensuring compatibility with servers like Jenkins and preventing "HTTP ERROR 400
Illegal Path Character" errors.

Changes:
- Percent-encode test artifact URLs to handle special characters
- Improve link robustness for parameterized test names

Fixes scylladb/scylla-pkg#4599
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21963
2024-12-26 10:23:52 +02:00
Konstantin Osipov
d87e1eb7ef test: merge topology_experimental_raft into topology_custom
This enables tablets in topology_custom, so explicitly
disable them where tests don't support tablets.
In scope of this rename patch a few imports.
Importing dependencies from another test is a bad idea -
please use shared libraries instead.

Fixed #20193

Closes scylladb/scylladb#22014
2024-12-26 00:33:08 +02:00
Yaron Kaikov
0fc7e786dd .github/scripts/auto-backport.py: fix wrong username param
In 2e6755ecca I have added a comment when PR has conflicts so the assignee can get a notification about it. There was a problem with the user mention param (a missing `.login`)

Fixing it

Closes scylladb/scylladb#22036
2024-12-25 20:41:34 +02:00
Avi Kivity
465449e4a1 test: combined_test: relicense
Was inadvertantly released under the AGPL.
2024-12-25 13:53:54 +02:00
Avi Kivity
3ffe93b6ae Merge 'Enhance load-and-stream with "scope"' from Pavel Emelyanov
The main purpose of this change is to enhance the restore from object storage usage.

Currently, restore uses the load-and-stream facility. When triggered, the restoring task opens the provided list of sstables directory from the remote bucket and then feeds the list of sstables to load_and_stream() method. The method, in turn, iterates over this list, reads mutations and for each mutation decides where to send one by checking the replication map (it's pretty much the same for both vnodes and tablets, but for tablets that are "fully contained" by a range there's the plan to stream faster).

As described above, restore is governed by a single node and this single node reads all sstables from the object store, which can be very slow. This PR allows speeding things up. For that, the load-and-stream code is equipped with the "scope" filter which limits where mutations can be streamed to. There are four options for that -- all, dc, rack and node. The "all" is how things work currently, "dc" and "rack" filter out target nodes that don't belong to this node's dc/rack respectively. The "node" scope only streams mutations to local node.

With the "node" scope it's possible to make all nodes in the cluster load mutations that belong to them in parallel, without re-sending them to peers. The last patch in this PR is the test that shows how it can be possible.

Closes scylladb/scylladb#21169

* github.com:scylladb/scylladb:
  test: Add scope-streaming test (for restore from backup)
  api: New "scope" API param to load-and-stream calls
  sstables_loader: Propagate scope from API down
  sstables_loader: Filter tablets based on scope
  streamer: Disable scoped streaming of primary replica only
  sstables_loader: Introduce streaming scope
  sstables_loader: Wrap get_endpoints()
2024-12-25 13:52:51 +02:00
Nadav Har'El
23213e8696 Merge 'Make get_built_indexes REST API endpoint be consistent with system."IndexInfo" table' from Pavel Emelyanov
It turned out that aforementioned APIs use slightly different sources of information about view build progress/status which sometimes results in different reporting of whether an index is built. It's good to make those two APIs consistent. Also add a test for the REST API endpoint (system table test was addressed by #21677).

Closes scylladb/scylladb#21814

* github.com:scylladb/scylladb:
  test: Add tests for MVs and indexes reporting by API endpoint(s)
  api: Use built_views table in get_built_indexes API
2024-12-25 11:47:03 +02:00
Evgeniy Naydanov
5992e8b031 test.py: topology_random_failures: more deselects for #21534
More cases found which can cause the same 'local_is_initialized()' assertion
during the node's bootstrap.
2024-12-25 06:38:13 +00:00
Evgeniy Naydanov
f337ecbafa test.py: topology_random_failures: handle more node's hangs during 30s sleep
The node is hanging and the coordinator just rollback a topology state.  It's different from
`stop_after_sending_join_node_request` and `stop_after_bootstrapping_initial_raft_configuration`
because in these cases the coordinator just not able to start the topology change at all and
a message in the coordinator's log is different.

Error injections handled:
  - `stop_after_updating_cdc_generation`
  - `stop_before_streaming`

And, actually, it can be any cluster event which lasts more than 30s.
2024-12-25 06:38:13 +00:00