scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 01:50:35 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	bbc655ff32	test/boost: update service_level_controller_test for workload prio Adjust some of the existing tests in service_level_controller_test.cc and add some more in order to test the workload prioritization features, i.e. the service level shares.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ce4032dfc0	qos: include number of shares in DESCRIBE Now, the CREATE statements generated for each service level by the DESCRIBE SCHEMA WITH INTERNALS statement will account for the service level's shares.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	0f62eb45d1	cql3/statements: update SL statements for workload prioritization Introduce the "SHARES" keyword which can be used in conjunction with existing CQL statements related to the service levels. Adjust the CQL statements for service levels: - CREATE/ALTER now allow to set shares (only if the cluster is fully upgraded) - LIST EFFECTIVE SERVICE LEVEL now return the number of shares in a new column - LIST SERVICE LEVEL(S) also return the number of shares, and has the additional column "percentage of all service level shares"	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	f1b9737e07	messaging_service: use separate set of connections per service levels In order to make sure that the scheduling group carries over RPC, and also to prevent priority inversion issues between different service levels, modify the messaging service to use separate RPC connections for each service level in order to serve user traffic. The above is achieved by reusing the existing concept of "tenants" in messaging service: when a new service level (or, more accurately, service-level specific scheduling group) is first used in an RPC, a new tenant is created. In addition, extend the service level controller to be able to quickly look up the service level name of the currently active scheduling group in order to speed up the logic for choosing the tenant.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	7383013f43	replica/database: add reader concurrency semaphore groups Replace the reader concurrency semaphores for user reads and view updates with the newly introduced reader concurrency semaphore group, which assigns a semaphore for each service level. Each group is statically assigned to some pool of memory on startup and dynamically distribute this memory between the semaphores, relative to the number of shares of the corresponding scheduling group. The intent of having a separate reader concurrency semaphore for each scheduling group is to prevent priority inversion issues due to reads with different priorities waiting on the same semaphore, as well as make memory allocation more fair between service levels due to the adjusted number of shares.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	4cfd26efaf	qos: manage and assign scheduling groups to service levels Introduce the core logic of workload prioritization, responsible for assigning scheduling groups to service levels. The service level controller maintains a pool of scheduling groups for the currently present service levels, as well as a pool of unused scheduling groups which were previously used by some service level that was deleted during node's lifetime. When a new service level is created, the SL controller either assigns a scheduling group from the unused SG pool, or creates a new one if the pool is empty. The scheduling group is renamed to "sl:<scheduling group name>". When updating shares of a service level (and also when creating a new service level), the shares of the corresponding scheduling group are synchronized with those of the service level. When a service level is deleted, its group is released to the aforementioned pool of unused scheduling groups and the prefix of its name is changed from "sl:" to "sl_deleted:". For now, these scheduling groups are not used by any user operations. This will be changed in subsequent commits.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ff51551a94	qos: use the shares field in service level reads/writes Now, the newly introduced `shares` field is used when service levels are either read from or written into system tables.	2025-01-02 07:13:34 +01:00
Avi Kivity	727f68e0f5	Merge 'cql3: allow SELECT of specific collection element' from Michael Litvak This adds to the grammar the option to SELECT a specific element in a collection (map/set/list). For example: `SELECT map['key'] FROM table` `SELECT map['key1']['key2'] FROM table` This feature was implemented in Cassandra 4.0 and was requested by scylla users. The behavior is mostly compatible with Cassandra, except: 1. in SELECT, we allow list subscript in a selector, while cassandra allows only map and set. 2. in UPDATE, we allow set subscript in a column condition, while cassandra allows only map and list. 3. the slice syntax `SELECT m[a..b]` is not implemented yet 4. null subscript - `SELECT m[null]` returns null in scylla, while cassandra returns error Fixes #7751 backport was requested for a user to be able to use it Closes scylladb/scylladb#22051 * github.com:scylladb/scylladb: cql3: allow SELECT of specific collection key cql3: allow set subscript	2025-01-01 14:48:40 +02:00
Avi Kivity	76cf5148e1	Merge 'message: introduce advanced rpc compression' from Michał Chojnowski This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows: After the patch, messaging_service (Scylla's interface for all inter-node communication) compresses its network traffic with compressors managed by the new advanced_rpc_compression::tracker. Those compressors compress with lz4, but can also be configured to use zstd as long as a CPU usage limit isn't crossed. A precomputed compression dictionary can be fed to the tracker. Each connection handled by the tracker will then start a negotiation with the other end to switch to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary. All traffic going through the tracker is passed as a single merged "stream" through dict_sampler. dictionary_service has access to the dict_sampler. On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain a random multi-megabyte sample of the sampler's stream. Every several minutes, it copies the sample, trains a compression dictionary on it (by calling zstd's training library via the alien_worker thread) and publishes the new dictionary to system.dicts via Raft's write_mutation command. This update triggers (eventually) a callback on all nodes, which feeds the new dictionary to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections to this dictionary. Closes scylladb/scylladb#22032 * github.com:scylladb/scylladb: messaging_service: use advanced_rpc_compression::tracker for compression message/dictionary_service: introduce dictionary_service service: make Raft group 0 aware of system.dicts db/system_keyspace: add system.dicts utils: add advanced_rpc_compressor utils: add dict_trainer utils: introduce reservoir_sampling utils: introduce alien_worker utils: add stream_compressor	2024-12-31 15:02:57 +02:00
Evgeniy Naydanov	4260f3f55a	test.py: topology_random_failures: log randomization parameters in test Logging randomization parameters in the pytest_generate_tests hook doesn't play well for us. To make these parameters more visible move the logging to the test level. Closes scylladb/scylladb#22055	2024-12-31 14:23:47 +02:00
Avi Kivity	2b48c2e72a	Merge 'build: add support for LTO and PGO to the building system' from Kefu Chai This changeset ports LTO and PGO support from scylla-enterprise.git to scylladb.git. Add support for Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO) to improve performance. LTO provides ~7% performance gain and enables crucial binary layout optimizations for PGO. LTO Changes: - Add `-flto` flag to compile and link steps - Use `-ffat-lto-objects` to generate both LLVM IR and machine code - Enable cross-object optimization while maintaining fast test linking PGO Implementation: - Implement three-stage build process: 1. Context-free profiling (`-fprofile-generate`) 2. Context-sensitive profiling (`-fprofile-use` + `-fcs-profile-generate`) 3. Final optimization using merged profiles - Add release-pgo and release-cs-pgo build stages - Integrate with ninja build system - Stages can be enabled independently Profile Management: - Add `pgo/pgo.py` for workload profile collection - Store default profile in `pgo/profiles/profile.profdata.xz` using Git LFS - Add configure.py integration for profile detection and validation - Support custom profiles via `--use-profile` flag - Add profile regeneration script Both optimizations are recommended for maximum performance, though each PGO stage adds a full build cycle. Future optimization may allow dropping one PGO stage if performance impact is minimal. --- this is a forward port, hence no need to backport. Closes scylladb/scylladb#22039 * github.com:scylladb/scylladb: build: cmake: add CMake options for PGO support build: cmake: add "Scylla_ENABLE_LTO" option build: set LTO and PGO flags for Seastar in cmake build build: collect scylla libraries with `scylla_libs` variable build: Unify Abseil CXX flags configuration configure.py: prepare the build for a default PGO profile in version control configure.py: introduce profile-guided optimization pgo: add alternator workloads training pgo: add a repair workload pgo: add a counters workload pgo: add a secondary index workload pgo: add a LWT workload pgo: add a decommission workload pgo: add a clustering workload pgo: add a basic workload pgo: introduce a PGO training script configure.py: don't include non-default modes in dist-server-* rules configure.py: enable LTO in release builds by default configure.py: introduce link-time optimization configure.py: add a `default` to `add_tristate`. configure.py: unify build rules for cxxbridge .cc files and regular .cc files	2024-12-31 14:14:40 +02:00
Avi Kivity	4905b1bf76	Merge 'table: make update_effective_replication_map sync again' from Benny Halevy Commit `f2ff701489` introduced a yield in update_effective_replication_map that might cause the storage_group manager to be inconsistent with the new effective_replication_map (e.g. if yielding right before calling `handle_tablet_split_completion`. Also, yielding inside storage_service::replicate_to_all_cores update loop means that base tables and their views aren't updated atomically, that caused scylladb/scylladb#17786 This change essentially reverts `f2ff701489` and makes handle_tablet_split_completion synchronous too. The stopped compaction groups future is kept as a member and storage_group_manager::stop() consumes this future during table::stop(). - storage_service: replicate_to_all_cores: update base and view tables atomically Currently, the loop updating all tables (including views) with the new effective_replication_map may yield, and therefore expose a state where the base and view tables effective_replication_map and topology are out of sync (as seen in scylladb/scylladb#17786) To prevent that, loop over all base tables and for each table update the base table and all views atomically, without yielding, and so allow yielding only between base tables. * Regression was introduced in `f2ff701489`, so backport is required to 6.x, 2024.2 Closes scylladb/scylladb#21781 * github.com:scylladb/scylladb: storage_service: replicate_to_all_cores: clear_gently pending erms test_mv_topology_change: drop delay_after_erm_update injection case storage_service: replicate_to_all_cores: update base and view tables atomically table: make update_effective_replication_map sync again	2024-12-30 23:42:06 +02:00
Tomasz Grabiec	bf3d0b3543	reader_concurrency_semaphore: Optimize resource_units destruction by postponing wait list processing Observed 3% throughput improvement in sstable-heavy workload bounded by CPU. SStable parsing involves lots of buffer operations which obtain and destroy resource_units. Before the patch, reosurce_unit destruction invoked maybe_admit_waiters(), which performs some computations on waiting permits. We don't really need to admit on each change of resources, since the CPU is used by other things anyway. We can batch the computation. There is already a fiber which does this for processing the _ready_list. We can reuse it for processing _wait_list as well. The changes violate an assumption made by tests that releasing resources immediately triggers an admission check. Therefore, some of the BOOST_REQUIRE_EQUAL needs to be replaced with REQUIRE_EVENTUALLY_EQUAL as the admision check is now done in the fiber processing the _ready_list. `perf-simple-query` --tablets --smp 1 -m 1G results obtained for fixed 400MHz frequency: Before: ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 112590.60 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41353 insns/op, 17992 cycles/op, 0 errors) 122620.68 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41310 insns/op, 17713 cycles/op, 0 errors) 118169.48 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41353 insns/op, 17857 cycles/op, 0 errors) 120634.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41328 insns/op, 17733 cycles/op, 0 errors) 117317.18 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41347 insns/op, 17822 cycles/op, 0 errors) throughput: mean=118266.52 standard-deviation=3797.81 median=118169.48 median-absolute-deviation=2368.13 maximum=122620.68 minimum=112590.60 instructions_per_op: mean=41337.86 standard-deviation=18.73 median=41346.89 median-absolute-deviation=14.64 maximum=41352.53 minimum=41309.83 cpu_cycles_per_op: mean=17823.50 standard-deviation=111.75 median=17821.97 median-absolute-deviation=90.45 maximum=17992.04 minimum=17713.00 ``` After ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 123689.63 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40997 insns/op, 17384 cycles/op, 0 errors) 129643.24 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40997 insns/op, 17325 cycles/op, 0 errors) 128907.27 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41009 insns/op, 17325 cycles/op, 0 errors) 130342.56 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40993 insns/op, 17286 cycles/op, 0 errors) 130294.09 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40972 insns/op, 17336 cycles/op, 0 errors) throughput: mean=128575.36 standard-deviation=2792.75 median=129643.24 median-absolute-deviation=1718.73 maximum=130342.56 minimum=123689.63 instructions_per_op: mean=40993.51 standard-deviation=13.23 median=40996.73 median-absolute-deviation=3.30 maximum=41008.86 minimum=40972.48 cpu_cycles_per_op: mean=17331.16 standard-deviation=35.02 median=17324.84 median-absolute-deviation=6.49 maximum=17383.97 minimum=17286.33 ``` Closes scylladb/scylladb#21918 [avi: patch was co-authored by Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>]	2024-12-30 23:37:46 +02:00
Michael Litvak	5ef7afb968	cql3: allow SELECT of specific collection key This adds to the grammar the option to SELECT a specific key in a collection column using subscript syntax. For example: SELECT map['key'] FROM table SELECT map['key1']['key2'] FROM table The key can also be parameterized in a prepared query. For this we need to pass the query options to result_set_builder where we process the selectors. Fixes scylladb/scylladb#7751	2024-12-30 17:05:20 +02:00
Piotr Smaron	2352063f20	server: set `connection_stage` to READY when authenticated If authentication is enabled, but STARTUP isn't followed by REGISTER (which is optional, and in practice only happens on only one of a driver's connections — because there's no point listening for the same events on multiple connections), connections are wrongly displayed in the system.clients as AUTHENTICATING instead of READY, even when they are ready. This commit fixes this problem. Fixes: scylladb/scylladb#12640 Closes scylladb/scylladb#21774	2024-12-30 14:04:26 +02:00
Kefu Chai	6281fb825f	test/pytest.ini: ignore warning on deprecated record_property fixture `record_property` generates XML which is not compatible with xunit2, so pytest decided to deprecated when the generating xunit reports. and pytest generates following warning when a test failure is reported using this fixture: ``` object_store/test_backup.py:337: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1') ``` this warning is not related to the test, but more about how we report a failure using pytrest. it is distracting, so let's silence it. See also https://github.com/pytest-dev/pytest/issues/5202 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22067	2024-12-30 10:58:31 +02:00
Nadav Har'El	27180620af	Merge 'topology_random_failures: deselect more cases which can cause #21534 ' from Evgeniy Naydanov There are many CI failures (repros of https://github.com/scylladb/scylladb/issues/21534) which caused by `stop_after_setting_mode_to_normal_raft_topology` and `stop_before_becoming_raft_voter` error injections in combination with some cluster events. Need to deselect them for now to make CI more stable. First batch deselected in https://github.com/scylladb/scylladb/pull/21658 Also, add the handling of topology state rollback caused by `stop_before_streaming` or `stop_after_updating_cdc_generation` error injections as a separate commit. See also https://github.com/scylladb/scylladb/issues/21872 and https://github.com/scylladb/scylladb/issues/21957 Closes scylladb/scylladb#22044 * github.com:scylladb/scylladb: test.py: topology_random_failures: more deselects for #21534 test.py: topology_random_failures: handle more node's hangs during 30s sleep	2024-12-30 10:52:22 +02:00
Michał Chojnowski	fdb2d2209c	messaging_service: use advanced_rpc_compression::tracker for compression This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`, `dict_sampler` and `dictionary_service` in `main()`, and wires them to each other and to `messaging_service`. `messaging_service` compresses its network traffic with compressors managed by the `advanced_rpc_compression::tracker`. All this traffic is passed as a single merged "stream" through `dict_sampler`. `dictionary_service` has access to `dict_sampler`. On chosen nodes (by default: the Raft leader), it uses the sampler to maintain a random multi-megabyte sample of the sampler's stream. Every several minutes, it copies the sample, trains a compression dictionary on it (by calling zstd's training library via the `alien_worker` thread) and publishes the new dictionary to `system.dicts` via Raft. This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes, which updates the dictionary used by the compressors it manages.	2024-12-27 10:17:58 +01:00
Kefu Chai	4154789670	build: cmake: add "Scylla_ENABLE_LTO" option add an option named "Scylla_ENABLE_LTO", which is off by default. if it is on, build the whole tree with ThinLTO enabled. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-12-27 16:16:04 +08:00
Kefu Chai	6acc5294a4	treewide: migrate from boost::copy_range to std::ranges::to now that we are allowed to use C++23. we now have the luxury of using `std::ranges::to`. in this change, we: - replace `boost::copy_range` to `std::ranges::to` - remove unused `#include` of boost headers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21880	2024-12-26 11:46:26 +02:00
Kefu Chai	6c031ad92f	test/topology: Percent-encode URL in pytest artifact links When embedding HTML documents in pytest reports with links to test artifacts, parameterized test names containing special characters like "[" and "]" can cause URL encoding issues. These characters, when used verbatim in URLs, can trigger HTTP 400 errors on web servers. This commit resolves the issue by percent-encoding the URLs for artifact links, ensuring compatibility with servers like Jenkins and preventing "HTTP ERROR 400 Illegal Path Character" errors. Changes: - Percent-encode test artifact URLs to handle special characters - Improve link robustness for parameterized test names Fixes scylladb/scylla-pkg#4599 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21963	2024-12-26 10:23:52 +02:00
Konstantin Osipov	d87e1eb7ef	test: merge topology_experimental_raft into topology_custom This enables tablets in topology_custom, so explicitly disable them where tests don't support tablets. In scope of this rename patch a few imports. Importing dependencies from another test is a bad idea - please use shared libraries instead. Fixed #20193 Closes scylladb/scylladb#22014	2024-12-26 00:33:08 +02:00
Avi Kivity	465449e4a1	test: combined_test: relicense Was inadvertantly released under the AGPL.	2024-12-25 13:53:54 +02:00
Avi Kivity	3ffe93b6ae	Merge 'Enhance load-and-stream with "scope"' from Pavel Emelyanov The main purpose of this change is to enhance the restore from object storage usage. Currently, restore uses the load-and-stream facility. When triggered, the restoring task opens the provided list of sstables directory from the remote bucket and then feeds the list of sstables to load_and_stream() method. The method, in turn, iterates over this list, reads mutations and for each mutation decides where to send one by checking the replication map (it's pretty much the same for both vnodes and tablets, but for tablets that are "fully contained" by a range there's the plan to stream faster). As described above, restore is governed by a single node and this single node reads all sstables from the object store, which can be very slow. This PR allows speeding things up. For that, the load-and-stream code is equipped with the "scope" filter which limits where mutations can be streamed to. There are four options for that -- all, dc, rack and node. The "all" is how things work currently, "dc" and "rack" filter out target nodes that don't belong to this node's dc/rack respectively. The "node" scope only streams mutations to local node. With the "node" scope it's possible to make all nodes in the cluster load mutations that belong to them in parallel, without re-sending them to peers. The last patch in this PR is the test that shows how it can be possible. Closes scylladb/scylladb#21169 * github.com:scylladb/scylladb: test: Add scope-streaming test (for restore from backup) api: New "scope" API param to load-and-stream calls sstables_loader: Propagate scope from API down sstables_loader: Filter tablets based on scope streamer: Disable scoped streaming of primary replica only sstables_loader: Introduce streaming scope sstables_loader: Wrap get_endpoints()	2024-12-25 13:52:51 +02:00
Nadav Har'El	23213e8696	Merge 'Make get_built_indexes REST API endpoint be consistent with system."IndexInfo" table' from Pavel Emelyanov It turned out that aforementioned APIs use slightly different sources of information about view build progress/status which sometimes results in different reporting of whether an index is built. It's good to make those two APIs consistent. Also add a test for the REST API endpoint (system table test was addressed by #21677). Closes scylladb/scylladb#21814 * github.com:scylladb/scylladb: test: Add tests for MVs and indexes reporting by API endpoint(s) api: Use built_views table in get_built_indexes API	2024-12-25 11:47:03 +02:00
Evgeniy Naydanov	5992e8b031	test.py: topology_random_failures: more deselects for #21534 More cases found which can cause the same 'local_is_initialized()' assertion during the node's bootstrap.	2024-12-25 06:38:13 +00:00
Evgeniy Naydanov	f337ecbafa	test.py: topology_random_failures: handle more node's hangs during 30s sleep The node is hanging and the coordinator just rollback a topology state. It's different from `stop_after_sending_join_node_request` and `stop_after_bootstrapping_initial_raft_configuration` because in these cases the coordinator just not able to start the topology change at all and a message in the coordinator's log is different. Error injections handled: - `stop_after_updating_cdc_generation` - `stop_before_streaming` And, actually, it can be any cluster event which lasts more than 30s.	2024-12-25 06:38:13 +00:00
Pavel Emelyanov	644d36996d	test: Add tests for MVs and indexes reporting by API endpoint(s) So far there's the /column_family/built_indexes one that reports the index names similar to how system.IndexInfo does, but it's not tested. This patch adds tests next to existing system. table ones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-24 16:18:32 +03:00
Benny Halevy	d1490bb7bf	locator/topology: do_sort_by_proximity: shuffle equal-distance replicas To improve balancing when reading in 1 < CL < ALL This implementation has a moderate impact on the function performance in contrast to full std::shuffle of the vector before stable_sort:ing it (especially with large number of nodes to sort). Before: test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6 After: sort_by_proximity_topology.perf_sort_by_proximity 19689561 50.195ns 0.119ns 50.076ns 51.145ns 0.000 0.000 622.5 150.6 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-24 13:00:17 +02:00
Benny Halevy	0fe8bdd0db	locator/topology: sort_by_proximity: calculate distance only once And use a temporary vector to use the precalculated distances. A later patch will add some randomization to shuffle nodes at the same distance from the reference node. This improves the function performance by 50% for 3 replicas, from 77.4 ns to 39.2 ns, larger replica sets show greater improvement (over 4X for 15 nodes): Before: test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6 After: sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-24 12:27:03 +02:00
Benny Halevy	75da99ce8b	test/perf: add perf_sort_by_proximity benchmark benchmark sort_by_proximity Baseline results on my desktop for sorting 3 nodes: single run iterations: 0 single run duration: 1.000s number of runs: 5 number of cores: 1 random seed: 20241224 test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-24 12:18:24 +02:00
Michał Chojnowski	0fd1050784	utils: add advanced_rpc_compressor Adds glue needed to pass lz4 and zstd with streaming and/or dictionaries as the network traffic compressors for Seastar's RPC servers. The main jobs of this glue are: 1. Implementing the API expected by Seastar from RPC compressors. 2. Expose metrics about the effectiveness of the compression. 3. Allow dynamically switching algorithms and dictionaries on a running connection, without any extra waits. The biggest design decision here is that the choice of algorithm and dictionary is negotiated by both sides of the connection, not dictated unilaterally by the sender. The negotiation algorithm is fairly complicated (a TLA+ model validating it is included in the commit). Unilateral compression choice would be much simpler. However, negotiation avoids re-sending the same dictionary over every connection in the cluster after dictionary updates (with one-way communication, it's the only reliable way to ensure that our receiver possesses the dictionary we are about to start using), lets receivers ask for a cheaper compression mode if they want, and lets them refuse to update a dictionary if they don't think they have enough free memory for that. In hindsight, those properties probably weren't worth the extra complexity and extra development effort. Zstd can be quite expensive, so this patch also includes a mechanism which temporarily downgrades the compressor from zstd to lz4 if zstd has been using too much CPU in a given slice of time. But it should be noted that this can't be treated as a reliable "protection" from negative performance effects of zstd, since a downgrade can happen on the sender side, and receivers are at the mercy of senders.	2024-12-23 23:37:02 +01:00
Michał Chojnowski	5294762ac7	utils: add dict_trainer	2024-12-23 23:37:02 +01:00
Michał Chojnowski	9de52b1c98	utils: introduce reservoir_sampling We are planning to improve some usages of compression in Scylla (in which we compress small blocks of data) by pre-training compression dictionaries on similar data seen so far. For example, many RPC messages have similar structure (and likely similar data), so the similarity could be exploited for better compression. This can be achieved e.g. by training a dictionary on the RPC traffic, and compressing subsequent RPC messages against that dictionary. To work well, the training should be fed a representative sample of the compressible data. Such a sample can be approached by taking a random subset (of some given reasonable size) of the data, with uniform probability. For our purposes, we need an online algorithm for this -- one which can select the random k-subset from a stream of arbitrary size (e.g. all RPC traffic over an hour), while requiring only the necessary minimum of memory. This is a known problem, called "reservoir sampling". This PR introduces `reservoir_sampler`, which implements an optimal algorithm for reservoir sampling. Additionally, it introduces `page_sampler` -- a wrapper for `reservoir_sampler`, which uses it to select a random sample of pages from a stream of bytes.	2024-12-23 23:37:02 +01:00
Michał Chojnowski	866326efe4	utils: add stream_compressor Adds utilities for "advanced" methods of compression with lz4 and zstd -- with streaming (a history buffer persisted across messages) and/or precomputed dictionaries. This patch is mostly just glue needed to use the underlying libraries with discontiguous input and output buffers, and for reusing the same compressor context objects across messages. It doesn't contain any innovations of its own. There is one "design decision" in the patch. The block format of LZ4 doesn't contain the length of the compressed blocks. At decompression time, that length must be delivered to the decompressor by a channel separate to the compressed block itself. In `lz4_cstream`, we deal with that by prepending a variable-length integer containing the compressed size to each compressed block. This is suboptimal for single-fragment messages, since the user of lz4_cstream is likely going to remember the length of the whole message anyway, which makes the length prepended to the block redundant. But a loss of 1 byte is probably acceptable for most uses.	2024-12-23 23:28:12 +01:00
Pavel Emelyanov	972ff80fad	test: Add scope-streaming test (for restore from backup) - create - a cluster with given topology - keyspace with tablets and given rf value - table with some data - backup - flush all nodes - kick backup API on every node - re-create keyspace and table - drop it first - create again with the same parameters and schema, but don't populate table with data - restore - collect nodes to contact and corresponding list of TOCs according to the preferred "scope" - ask selected nodes to restore, limiting its streaming scope and providing the specific list of sstables - check - select mutation fragments from all nodes for random keys - make sure that the number of non-empty responses equals the expected rf value Specific topologies, RFs and stream scopes used are: rf = 1, nodes = 3, racks = 1, dcs = 1, scope = node rf = 3, nodes = 5, racks = 1, dcs = 1, scope = node rf = 1, nodes = 4, racks = 2, dcs = 1, scope = rack rf = 3, nodes = 6, racks = 2, dcs = 1, scope = rack rf = 3, nodes = 6, racks = 3, dcs = 1, scope = rack rf = 2, nodes = 8, racks = 4, dcs = 2, scope = dc nodes and racks are evenly distributed in racks and dcs respectively in the last topo RF effectively becomes 4 (2 in each dc) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-23 19:28:05 +03:00
Pavel Emelyanov	a24dc02255	api: New "scope" API param to load-and-stream calls There are two of those -- the POST /storage_service/keyspace that loads and streams new sstables from /upload and POST /storage_service/restore that does the same, but gets sstables from object store. The new optional parameter allow users to tun the streaming phase behavior. The test/pylib client part is also updated here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-23 19:28:05 +03:00
Benny Halevy	68b0b442fd	locator: refactor sort_by_proximity Extract can_sort_by_proximity() out so it can be used later by storage_proxy, and introduce do_sort_by_proximity that sorts unconditionally. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-23 16:42:55 +02:00
Takuya ASADA	03461d6a54	test: compile unit tests into a single executable To reduce test executable size and speed up compilation time, compile unit tests into a single executable. Here is a file size comparison of the unit test executable: - Before applying the patch $ du -h --exclude='.o' --exclude='.o.d' build/release/test/boost/ build/debug/test/boost/ 11G build/release/test/boost/ 29G build/debug/test/boost/ - After applying the patch du -h --exclude='.o' --exclude='.o.d' build/release/test/boost/ build/debug/test/boost/ 5.5G build/release/test/boost/ 19G build/debug/test/boost/ It reduces executable sizes 5.5GB on release, and 10GB on debug. Closes #9155 Closes scylladb/scylladb#21443	2024-12-22 19:14:09 +02:00
Avi Kivity	f8ce49ebe9	cql3: implement NOT IN Where the grammar supports IN, we add NOT IN. This includes the WHERE clause and LWT IF clause. Evaluation of NOT IN follows from IN. In statement_restrictions analysis, they are different, as NOT IN doesn't enable any clever query plan and must filter. Some tests are added. An error message was changed ('in' changed to 'IN'), so some tests are adjusted. Closes scylladb/scylladb#21992	2024-12-22 15:15:23 +02:00
Kefu Chai	10c79a4d47	test/pylib: do not check for self.cmd when tearing down ScyllaServer we already check `self.cmd` for null at the very beginning of the `ScyllaServer.stop()`, and in the `try` block, we don't reset `self.cmd`, hence there is no need to check it again. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21936	2024-12-20 16:21:40 +02:00
Avi Kivity	eb62593f2c	treewide: use angle brackets when including seastar headers We treat Seastar as a "system" library, and those are included with angle brackets. Closes scylladb/scylladb#21959	2024-12-20 16:16:28 +02:00
Aleksandra Martyniuk	1c29726477	replica: do not set tablet_task_info if it isn't valid Currently, in tablet_map_to_mutation, repair's and migration's tablet_task_info is always set. Do not set the tablet_task_info if there is no running operation. Closes scylladb/scylladb#22005	2024-12-20 16:10:53 +02:00
Kefu Chai	2a9f34bb85	test/pytest.ini: put `repair` marker declaration back During the consolidation of per-suite pytest.ini files (commit `8bf62a086f`), the 'repair' marker was inadvertently dropped. This led to pytest warnings for tests using the @pytest.mark.repair decorator. This patch restores the marker declaration to eliminate the distracting PytestUnknownMarkWarning: ``` test/topology_experimental_raft/test_tablets.py:396 /home/kefu/dev/scylladb/test/topology_experimental_raft/test_tablets.py:396: PytestUnknownMarkWarning: Unknown pytest.mark.repair - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html @pytest.mark.repair ``` Restoring the marker allows tests to use the 'repair' mark without generating warnings. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21931	2024-12-20 14:04:50 +02:00
Botond Dénes	42d24b2a8a	Merge 'Retire topology::sort_by_proximity and compare_endpoints flavors using gms::inet_address' from Benny Halevy This series converts the call site using compare_endpoints with gms::inet_address. With that both flavors of compare_endpoints and sort_by_proximity for inet_address can be retired as no other uses remain. Also, add a unit test for topology::sort_by_proximity before further changes to it are considered. * Code cleanup, no backport is needed Closes scylladb/scylladb#21976 * github.com:scylladb/scylladb: test: network_topology_strategy_test: add test_topology_sort_by_proximity locator/topology: retire sort_by_proximity/compare_endpoints for inet_address test: test_topology_compare_endpoints: use host_id:s	2024-12-20 13:34:55 +02:00
Kefu Chai	24283d9dd0	test/topology: rename manager_internal to manager_client instead of reusing the variable name and overriding the parameter, use a new name for the return value of `manager_internal()` for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21932	2024-12-20 13:01:45 +02:00
Botond Dénes	d4129ddaa6	Merge 'sstables_manager: do not reclaim unlinked sstables' from Lakshmi Narayanan Sreethar When an sstable is unlinked, it remains in the _active list of the sstable manager. Its memory might be reclaimed and later reloaded, causing issues since the sstable is already unlinked. This patch updates the on_unlink method to reclaim memory from the sstable upon unlinking, remove it from memory tracking, and thereby prevent the issues described above. Added a testcase to verify the fix. Fixes #21887 This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions. Closes scylladb/scylladb#21895 * github.com:scylladb/scylladb: sstables_manager: reclaim memory from sstables on unlink sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable() sstables: introduce disable_component_memory_reload() sstables_manager: log sstable name when reclaiming components	2024-12-19 15:18:16 +02:00
Michał Chojnowski	f6ebd445e4	test_tablets.py: limit concurrency in test_tablet_storage_freeing Apparently the python driver can't deal with the current concurrency sometimes. Lower it from 1000 to 100. Fixes scylladb/scylladb#20489 Closes scylladb/scylladb#20494	2024-12-19 15:14:41 +02:00
Pavel Emelyanov	bb094cc099	Merge 'Make restore task abortable' from Calle Wilund Fixes #20717 Enables abortable interface and propagates abort_source to all s3 objects used for reading the restore data. Note: because restore is done on each shard, we have to maintain a per-shard abort source proxy for each, and do a background per-shard abort on abort call. This is synced at the end of "run()". Abort source is added as an optional parameter to s3 storage and the s3 path in distributed loader. There is no attempt to "clean up" an aborted restore. As we read on a mutation level from remote sstables, we should not cause incomplete sstables as such, even though we might end up of course with partial data restored. Closes scylladb/scylladb#21567 * github.com:scylladb/scylladb: test_backup: Add restore abort test case sstables_loader: Make restore task abortable distributed_loader: Add optional abort_source to get_sstables_from_object_store s3_storage: Add optional abort_source to params/object s3::client: Make "readable_file" abortable	2024-12-19 12:23:33 +03:00
Benny Halevy	67b7015ced	test: network_topology_strategy_test: add test_topology_sort_by_proximity Before further changes are made to sort_by_proximity add a unit test for it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-19 09:45:02 +02:00

1 2 3 4 5 ...

8039 Commits