scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-02 04:56:58 +00:00

Author	SHA1	Message	Date
Botond Dénes	1ab697693f	Merge 'compaction/twcs: fix use after free issues' from Lakshmi Narayanan Sreethar The `compaction_strategy_state` class holds strategy specific state via a `std::variant` containing different state types. When a compaction strategy performs compaction, it retrieves a reference to its state from the `compaction_strategy_state` object. If the table's compaction strategy is ALTERed while a compaction is in progress, the `compaction_strategy_state` object gets replaced, destroying the old state. This leaves the ongoing compaction holding a dangling reference, resulting in a use after free. Fix this by using `seastar::shared_ptr` for the state variant alternatives(`leveled_compaction_strategy_state_ptr` and `time_window_compaction_strategy_state_ptr`). The compaction strategies now hold a copy of the shared_ptr, ensuring the state remains valid for the duration of the compaction even if the strategy is altered. The `compaction_strategy_state` itself is still passed by reference and only the variant alternatives use shared_ptrs. This allows ongoing compactions to retain ownership of the state independently of the wrapper's lifetime. The method `maybe_wait_for_sstable_count_reduction()`, when retrieving the list of sstables for a possible compaction, holds a reference to the compaction strategy. If the strategy is updated during execution, it can cause a use after free issue. To prevent this, hold a copy of the compaction strategy so it isn’t yanked away during the method’s execution. Fixes #25913 Issue probably started after `9d3755f276`, so backport to 2025.4 Closes scylladb/scylladb#26593 * github.com:scylladb/scylladb: compaction: fix use after free when strategy is altered during compaction compaction/twcs: pass compaction_strategy_state to internal methods compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction	2025-10-20 10:45:47 +03:00
Lakshmi Narayanan Sreethar	18c071c94b	compaction: fix use after free when strategy is altered during compaction The `compaction_strategy_state` class holds strategy specific state via a `std::variant` containing different state types. When a compaction strategy performs compaction, it retrieves a reference to its state from the `compaction_strategy_state` object. If the table's compaction strategy is ALTERed while a compaction is in progress, the `compaction_strategy_state` object gets replaced, destroying the old state. This leaves the ongoing compaction holding a dangling reference, resulting in a use after free. Fix this by using `seastar::shared_ptr` for the state variant alternatives(`leveled_compaction_strategy_state_ptr` and `time_window_compaction_strategy_state_ptr`). The compaction strategies now hold a copy of the shared_ptr, ensuring the state remains valid for the duration of the compaction even if the strategy is altered. The `compaction_strategy_state` itself is still passed by reference and only the variant alternatives use shared_ptrs. This allows ongoing compactions to retain ownership of the state independently of the wrapper's lifetime. Fixes #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-17 22:57:05 +05:30
Pavel Emelyanov	068d788084	transport: Don't use scattered_message The API to put scattered_message into output_stream() is gone in seastar API level 9, transport is the only place in Scylla that still uses it. The change is to put the response as a sequence of temporary_buffer-s. This preserves the zero-copy-ness of the reply, but needs few things to care about. First, the response header frame needs to be put as zero-copy buffer too. Despite output_stream() supports semi-mixed mode, where z.c. buffers can follow the buffered writes, it won't apply here. The socket is flushed() in batched mode, so even if the first reply populates the stream with data and flushes it, the next response may happen to start putting the header frame before delayed flush took place. Second, because socket is flushed in batch-flush poller, the temporary buffers that are put into it must hold the foreigh_ptr with the response object. With scattered message this was implemented with the help of a delter that was attached to the message, now the deleter is shared between all buffers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:17:08 +03:00
Tomasz Grabiec	c4a87453a2	Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace. Closes scylladb/scylladb#26116 * github.com:scylladb/scylladb: schema: Allow configuring consistency setting for a keyspace db: experimental consistent-tablets option	2025-10-16 21:48:06 +02:00
Emil Maskovsky	6769c313c2	raft: small fixes for voters code Minor cleanups and improvements to voter-related code. No backport: cleanup only, no functional changes. Closes scylladb/scylladb#26559	2025-10-16 18:41:08 +02:00
Gleb Natapov	c255740989	schema: Allow configuring consistency setting for a keyspace We want to add strongly consistent tables as an option. We will have two kind of strongly consistent tables: globally consistent and locally consistent. The former means that requests from all DCs will be globally linearisable while the later - only requests to the same DCs will be linearisable. To allow configuring all the possibilities the patch adds new parameter to a keyspace definition "consistency" that can be configured to be `eventual`, `global` or `local`. Non eventual setting is supported for tablets enabled keyspaces only. Since we want to start with implementing local consistency configuring global consistency will result in an error for now.	2025-10-16 13:34:49 +03:00
Avi Kivity	8f1de2a7ad	Merge 'test/boost: speed up test test_indexing_paging_and_aggregation by making internal page size configurable' from Nadav Har'El The C++ test `test_indexing_paging_and_aggregation` is one of the slowest tests in test/boost. The reason for its slowness is that it needs a table with more rows than SELECT's "DEFAULT_COUNT_PAGE_SIZE" which was hard-coded to 10,000, so the test needed to write and read tens of thousands of rows, and did it multiple times. It turns out the code actually had an ad-hoc mechanism to override DEFAULT_COUNT_PAGE_SIZE in a C++ test, but both this mechanism and the test itself were so opaque I didn't find it until I fixed it in a different way: What I ended up doing in this pull request is the following (each step in a separate patch): 1. Rewrite this test in Python, in the test/cqlpy framework. This was straightforward, as this test only used CQL and not internal interfaces. The reason why this test wasn't written in Python in the first place is that it was written in 2019, a year before cqlpy existed. A added extensive comments to the new tests, and I finally understood what it was doing :-) 2. I replaced the ad-hoc C++-test-only mechanism of overriding DEFAULT_COUNT_PAGE_SIZE by a bona-fide configuration parameter, `select_internal_page_size`. 3. Finally, the Python test can temporarily lower `select_internal_page_size` and use a table with much fewer rows. After this series, the test `test_indexing_paging_and_aggregation` (which is now in Python instead of C++) takes around half a second, 20 times faster than before. I expect the speedup to be even more dramatic for the debug build. Closes scylladb/scylladb#25368 * github.com:scylladb/scylladb: cql: make SELECT's "internal page size" configurable secondary index: translate test_indexing_paging_and_aggregation to Python	2025-10-16 11:58:13 +03:00
Botond Dénes	cb27c3d6e9	tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc Currently, to disable tombstone-gc on-demand completely, one has to pass down a bool flag along with the already required tombstone_gc_state to the code which does the compacting. This is redundant and confusing, the tombstone_gc_state is supposed to encapsulate all tombstone-gc related logic in a transparent way. Add dedicated factory methods for no-gc and gc-all, to allow creating a tombstone_gc_state which transparently gcs for all or no tombstones.	2025-10-16 10:38:47 +03:00
Nadav Har'El	921d07a26b	cql: make SELECT's "internal page size" configurable In some uses of SELECT, such as aggregation (sum() et al.), GROUP BY or secondary index, it needs to perform internal scans. It uses an "internal page size" which before this patch was always DEFAULT_COUNT_PAGE_SIZE = 10000. There was an ad-hoc and undocumented way to override this default in C++ tests, using functions in test/lib/select_statement_utils.hh, but it was so non-obvious that the test that most needed to override this default - the very slow test test_indexing_paging_and_aggregation which would have been must faster with a lower setting - never used it. So in this patch we replace the ad-hoc configuration functions by a bona-fide Scylla configuration option named "select_internal_page_size". The few C++ tests that used the old configuration functions were modified to use the new configuration parameters. The slow test test_indexing_paging_and_aggregation still doesn't use the new configuration to become faster - we'll do this in the next patch. Another benefit of having this "internal page size" as a configuration option is that one day a user might realize that the default choice 10,000 is bad for some reason (which I can't envision right now), so having it configurable might come it handy. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-15 18:42:09 +03:00
Nadav Har'El	afc5379148	secondary index: translate test_indexing_paging_and_aggregation to Python The Boost test test_indexing_paging_and_aggregation is one of the slowest boost tests. But it's hard to understand why it needs to be so slow - the C++ test code is opaque, and uncommented. The test didn't need to be in C++ - it only uses CQL, not any internal interfaces - but it was written in 2019, a year before test/cqlpy was created. So before we can make this test faster, this patch translates it to Python and adds significant amount of comments. The new Python test is functionally identical to the old C++ test - it is not (yet) made smaller or faster. The new test takes a whopping 9 seconds to run on my laptop (in dev build mode). We'll reduce that in the next patch. As usual, the cqlpy test can also be tested on Cassandra, and unsurprisingly, it passes. Refs #16134 (which asks to translate more MV and SI tests to Python). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-15 17:50:37 +03:00
Pavel Emelyanov	7bd50437ff	test: Remove unused operator<<(radix_tree_test::test_data) It was used while debugging the test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26458	2025-10-15 11:57:56 +02:00
Calle Wilund	5e4e5b1f4a	sstables::object_storage_client: Add multi-upload support for GS Uses file splitting + object merge to facilitate parallel, resumable upload of files with known size.	2025-10-13 08:53:27 +00:00
Calle Wilund	bd1304972c	utils::gcp::storage: Add merge objects operation Allows merging 1-32 smaller files into a destination.	2025-10-13 08:53:27 +00:00
Calle Wilund	da36a9d78e	boost::gcs_storage_test: reindent Remove redundant indentation/moosewings.	2025-10-13 08:53:27 +00:00
Calle Wilund	1356f60c69	boost::gcs_storage_test: Convert to use fixture Instead of test-local server/endpoint etc, use the gcs test fixture, with the added bonus of a suite-shared one for additional speed.	2025-10-13 08:53:27 +00:00
Calle Wilund	7c6b4bed97	tests::boost: Add GS object storage cases to mirror S3 ones I.e. run same remote storage backend unit tests for GS backend	2025-10-13 08:53:27 +00:00
Calle Wilund	78d9dda060	config: break out object_storage_endpoint_param preparing for multi storage Moves the config wrapper to own file (to reduce recompilation for modifying) and refactors to handle extending this parameter to non-s3 endpoint configs.	2025-10-13 08:53:24 +00:00
Botond Dénes	24c6476f73	mutation/mutation_compactor: add tombstone_gc_state to query ctor So tombstones can be purged correctly based on the tombstone gc mode. Currently if repair-mode is used, tombstones are not purged at all, which can lead to purged tombstone being re-replicated to replicas which already purged them via read-repair. This is not a correctness problem, tombstones are not included in data query resutl or digest, these purgable tombstone are only a nuissance for read repair, where they can create extra differences between replicas. Note that for the read repair to trigger, some difference other than in purgable tombstones has to exist, because as mentioned above, these are not included in digets. Fixes: scylladb/scylladb#24332 Closes scylladb/scylladb#26351	2025-10-12 17:48:15 +03:00
Michał Chojnowski	7c6e84e2ec	test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test It turns out that Boost assertions are thread-unsafe, (and can't be used from multiple threads concurrently). This causes the test to fail with cryptic log corruptions sometimes. Fix that by switching to thread-safe checks. Fixes scylladb/scylladb#24982 Closes scylladb/scylladb#26472	2025-10-12 17:16:51 +03:00
Andrzej Jackowski	14081d0727	generic_server: transport: start using `sl:driver` for new connections Before this change, new connections were handled in a default scheduling group (`main`), because before the user is authenticated we do not know which service level should be used. With the new `sl:driver` service level, creation of new connections can be moved to `sl:driver`. We switch the service level as early as possible, in `do_accepts`. There is a possibility, that `sl:driver` will not exist yet, for instance, in specific upgrade cases, or if it was removed. Therefore, we also switch to `sl:driver` after a connection is accepted. Refs: scylladb/scylladb#24411	2025-10-08 08:25:12 +02:00
Andrzej Jackowski	c59a7db1c9	service_level_controller: automatically create `sl:driver` This commit: - Increases the number of allowed scheduling groups to allow the creation of `sl:driver`. - Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating `sl:driver` until all nodes have increased the number of scheduling groups. - Starts using `get_create_driver_service_level_mutations` to unconditionally create `sl:driver` on `raft_initialize_discovery_leader`. The purpose of this code path is ensuring existence of `sl:driver` in new system and tests. - Starts using `migrate_to_driver_service_level` to create `sl:driver` if it is not already present. The creation of `sl:driver` is managed by `topology_coordinator`, similar to other system keyspace updates, such as the `view_builder` migration. The purpose of this code path is handling upgrades. - Modifies related tests to pass after `sl:driver` is added. Later in this patch series, `sl:driver` will be used by `transport/server` to handle selected traffic, such as the driver's schema and topology fetches. Refs: scylladb/scylladb#24411	2025-10-08 08:24:43 +02:00
Piotr Dulikowski	380f243986	Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names. For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] } Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs. Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks. Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported. New feature, no backport required. Co-authored with @bhalevy Fixes https://github.com/scylladb/scylladb/issues/25269 Fixes https://github.com/scylladb/scylladb/issues/23525 Closes scylladb/scylladb#26358 * github.com:scylladb/scylladb: tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count locator: Make hasher for endpoint_dc_rack globally accessible test: tablets: Add test for replica allocation on rack list changes test: lib: topology_builder: generate unique rack names test: Add tests for rack list RF doc: Document rack-list replication factor topology_coordinator: Restore formatting topology_coordinator: Cancel keyspace alter on broader set of errors topology_coordinator: Make keyspace alter process options through as_ks_metadata_update() cql3: ks_prop_defs: Preserve old options cql3: ks_prop_defs: Introduce flattened() locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace() tablet_allocator: Respect binding replicas to racks locator: network_topology_strategy: Respect rack list when reallocating tablets cql3: ks_prop_defs: Fail with more information when options are not in expected format locator, cql3: Support rack lists in replication options cql3: Fail early on vnode/tablet flavor alter cql3: Extract convert_property_map() out of Cql.g schema: Use definition from the header instead of open-coding it locator: Abstract obtaining the number of replicas from replication_strategy_config_option cql3, locator: Use type aliases for option maps locator: Add debug logging locator: Pass topology to replication strategy constructor abstract_replication_strategy, network_topology_strategy: add replication_factor_data class	2025-10-06 14:14:09 +02:00
Piotr Dulikowski	e7907b173a	Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek Materialized views are currently in the experimental phase and using them in tablet-based keyspaces requires starting Scylla with an experimental feature, `views-with-tablets`. Any attempts to create a materialized view or secondary index when it's not enabled will fail with an appropriate error. After considerable effort, we're drawing close to bringing views out of the experimental phase, and the experimental feature will no longer be needed. However, materialized views in tablet-based keyspaces will still be restricted, and creating them will only be possible after enabling the configuration option `rf_rack_valid_keyspaces`. That's what we do in this PR. In this patch, we adjust existing tests in the tree to work with the new restriction. That shouldn't have been necessary because we've already seemingly adjusted all of them to work with the configuration option, but some tests hid well. We fix that mistake now. After that, we introduce the new restriction. What's more, when starting Scylla, we verify that there is no materialized view that would violate the contract. If there are some that do, we list them, notify the user, and refuse to start. High-level implementation strategy: 1. Name the restrictions in form of a function. 2. Adjust existing tests. 3. Restrict materialized views by both the experimental feature and the configuration option. Add validation test. 4. Drop the requirement for the experimental feature. Adjust the added test and add a new one. 5. Update the user documentation. Fixes scylladb/scylladb#23030 Backport: 2025.4, as we are aiming to support materialized views for tablets from that version. Closes scylladb/scylladb#25802 * github.com:scylladb/scylladb: view: Stop requiring experimental feature db/view: Verify valid configuration for tablet-based views db/view: Require rf_rack_valid_keyspaces when creating view test/cluster/random_failures: Skip creating secondary indexes test/cluster/mv: Mark test_mv_rf_change as skipped test/cluster: Adjust MV tests to RF-rack-validity test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces db/view: Name requirement for views with tablets	2025-10-06 12:46:46 +02:00
Pavel Emelyanov	6ad8dc4a44	Merge 'root,replica: mv querier to replica/' from Botond Dénes The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name and because at the time it was introduced, namespace replica didn't exist yet. But this is a mistake which confuses people. The querier is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module and namespace to make this more clear. Code cleanup, no backport. Closes scylladb/scylladb#26280 * github.com:scylladb/scylladb: replica: move querier code to replica namespace root,replica: mv querier to replica/	2025-10-06 08:26:05 +03:00
Michał Chojnowski	6efb807c1a	sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db TemporaryHashes.db is a temporary sstable component used during ms sstable writes. It's different from other sstable components in that it's not included in the TOC. Because of this, it has a special case in the logic that deletes unfinished sstables on boot. (After Scylla dies in the middle of a sstable write). But there's a bug in that special case, which causes Scylla to forget to delete other components from the same unfinished sstable. The code intends only to delete the TemporaryHashes.db file from the `_state->generations_found` multimap, but it accidentally also deletes the file's sibling components from the multimap. Fix that. Fixes scylladb/scylladb#26393	2025-10-04 00:45:55 +02:00
Michał Chojnowski	16cb223d7f	test/boost/database_test: fix two no-op distributed loader tests There are two tests which effectively check nothing. They intend to check that distributed loader removes "leftover" sstable files. So they create some incomplete sstables, run the test env on the directory, and the files disappeared. But the test env completely clears the test directory before the distributed loader looks at the files, so the tests succeed trivially. Fix that by adding a config knob to the test env which instructs it not to clear the directory before the test.	2025-10-04 00:44:49 +02:00
Michał Hudobski	e8fb745965	test: add tests for VECTOR_SEARCH_INDEXING permission This commit adds tests to verify the expected behavior of the VECTOR_SEARCH_INDEXING permission, that is, allowing GRANTing this permission only on ALL KEYSPACES and allowing SELECT queries only on tables with vector indexes when the user has this permission	2025-10-03 16:55:57 +02:00
Tomasz Grabiec	9ebdeb261f	tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count The old logic assumes that replicas are spread across whole DC when determining how many tablets we need to have at least 10 tablets per shard. If replicas are actually confined to a subset of racks, that will come up with a too high count and overshoot actual per-shard count in this rack. Similar problem happens for scaling-down of tablet count, when we try to keep per-shard tablet count below the goal. It should be tracked per-rack rather than per-DC, since racks can differ in how loaded they are by RF if it's a rack-list.	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	85ddb832b4	test: tablets: Add test for replica allocation on rack list changes	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	5fc617ecf5	test: Add tests for rack list RF	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	6de342ed3e	locator: network_topology_strategy: Respect rack list when reallocating tablets	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	726548b835	locator: Abstract obtaining the number of replicas from replication_strategy_config_option It will become more complex when options will contain rack lists. It's a good change regardless, as it reduces duplication and makes parsing uniform. We already diverged to use stoi / stol / stoul. The change in create_keyspace_statement.cc to add a catch clause is needed because get_replication_factor() now throws configuration_exception on parsing errors instead of std::invalid_argument, so the existing catch clause in the outer scope is not effective. That loop is trying to interpret all options as RF to run some validations. Not all options are RF, and those are supposed to be ignored.	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	91e51a5dd1	cql3, locator: Use type aliases for option maps In preparation for changing their structure. 1) std::map<sstring, sstring> -> replication_strategy_config_options Parsed options. Values will become std::variant<sstring, rack_list> 2) std::map<sstring, sstring> -> property_definitions::map_type Flattened map of options, as stored system tables.	2025-10-01 16:06:51 +02:00
Benny Halevy	da6e2fdb1b	locator: Pass topology to replication strategy constructor	2025-10-01 16:06:28 +02:00
Avi Kivity	15fa1c1c7e	Merge 'sstables/trie: translate all key cells in one go, not lazily' from Michał Chojnowski Applying lazy evaluation to the BTI encoding of clustering keys was probably a bad default. The possible benefits are dubious (because it's quite likely that the laziness won't allow us to avoid that much work), but the overhead needed to implement the laziness is large and immediate. In this patch we get rid of the laziness. We rewrite lazy_comparable_bytes_from_clustering_position and lazy_comparable_bytes_from_ring_position so that they performs the key translation eagerly, all components to a single bytes_ostream in one synchronous call. perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type): ``` Before: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6 After: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9 ``` Enhancement, backport not required. Closes scylladb/scylladb#26302 * github.com:scylladb/scylladb: sstables/trie: BTI-translate the entire partition key at once sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset() sstables/trie: perform the BTI-encoding of position_in_partition eagerly types/comparable_bytes: add comparable_bytes_from_compound test/perf: add perf_bti_key_translation	2025-10-01 14:59:06 +03:00
Botond Dénes	bdca5600ef	Merge 'Prevent stalls due to large tablet mutations' from Benny Halevy Currently, replica::tablet_map_to_mutation generates a mutation having a row per tablet. With enough tablets (10s of thousands) in the table we observe reactor stalls when freezing / unfreezing such large mutations, as seen in https://github.com/scylladb/scylladb/pull/18095#issuecomment-2029246954, and I assume we would see similar stalls also when converting those mutation into canonical_mutation and back, as they are similar to frozen_mutation, and bit more expensive since they also save the column mappings. This series takes a different approach than allowing freeze to yield. `tablet_map_to_mutation` is changed to `tablet_map_to_mutations`, able to generate multiple split mutations, that when squashed together are equivalent to the previously large mutation. Those mutations are fed into a `process_mutation` callback function, provided by the caller, which may add those mutation to a vector for further processing, and/or process them inline by freezing or making a canonical mutation. In addition, split the large mutations would also prevent hitting the commitlog maximum mutation size. Closes scylladb/scylladb#18162 * github.com:scylladb/scylladb: schema_tables: convert_schema_to_mutations: simplify check for system keyspace tablets: read_tablet_mutations: use unfreeze_and_split_gently storage_service: merge_topology_snapshot: freeze snp.mutations gently mutation: async_utils: add unfreeze_and_split_gently mutation: add for_each_split_mutation tablets: tablet_map_to_mutations: maybe split tablets mutation tablets: tablet_map_to_mutations: accept process_func perf-tablets: change default tables and tablets-per-table perf-tablets: abort on unhandled exception	2025-10-01 07:04:09 +03:00
Benny Halevy	aaddff5211	tablets: tablet_map_to_mutations: accept process_func Prepare for generating several mutations for the tablet_map by calling process_func for each generated mutation. This allows the caller to directly freeze those mutations one at a time into a vector of frozen mutations or simililarly convert them into canonical mutations. Next patch will split large tablet mutations to prevent stalls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:38 +03:00
Nadav Har'El	926089746b	message: move RPC compression from utils/ to message/ The directory utils/ is supposed to contain general-purpose utility classes and functions, which are either already used across the project, or are designed to be used across the project. This patch moves 8 files out of utils/: utils/advanced_rpc_compressor.hh utils/advanced_rpc_compressor.cc utils/advanced_rpc_compressor_protocol.hh utils/stream_compressor.hh utils/stream_compressor.cc utils/dict_trainer.cc utils/dict_trainer.hh utils/shared_dict.hh These 8 files together implement the compression feature of RPC. None of them are used by any other Scylla component (e.g., sstables have a different compression), or are ready to be used by another component, so this patch moves all of them into message/, where RPC is implemented. Theoretically, we may want in the future to use this cluster of classes for some other component, but even then, we shouldn't just have these files individually in utils/ - these are not useful stand-alone utilities. One cannot use "shared_dict.hh" assuming it is some sort of general-purpose shared hash table or something - it is completely specific to compression and zstd, and specifically to its use in those other classes. Beyond moving these 8 files, this patch also contains changes to: 1. Fix includes to the 5 moved header files (.hh). 2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt for the three moved source files (.cc). 3. In the moved files, change from the "utils::" namespace, to the "netw::" namespace used by RPC. Also needed to change a bunch of callers for the new namespace. Also, had to add "utils::" explicitly in several places which previously assumed the current namespace is "utils::". Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25149	2025-09-30 17:03:09 +03:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Piotr Dulikowski	4581c72430	Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev `SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables. We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now. Fixes https://github.com/scylladb/scylladb/issues/26258 backports: not needed (a new feature) Closes scylladb/scylladb#26284 * github.com:scylladb/scylladb: cql_test_env.cc: log exception when callback throws lwt: prohibit for tablet-based views and cdc logs tablets: disallow chains of colocated tables database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda	2025-09-30 07:15:16 +02:00
Michał Chojnowski	771a82969e	test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes Adds a test for the bloom filter rebuild mechanism in `ms` sstables.	2025-09-29 22:15:26 +02:00
Michał Chojnowski	fe9f5f4da2	sstables: add `ms` to `all_sstable_versions` Add `ms` to the lists of sstable formats. This will cause it to be included in various unit tests.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	9155eeed10	test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests Add `ms` to tests which already test many format versions. The tests check that sstable files in newer verisons are the same as in `mc`. Arbitrarily, for `ms`, we only check the files common between `mc` and `ms`. If we want to extend this test more, so that it checks that `Partitions.db` and `Rows.db` don't change over time, we have to add `ms` versions of all the sstables under `test/resources` which are used in this test. We won't do that in this patch series. And I'm not sure if we want to do that at all.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	d53f362328	test/boost/sstable_inexact_index_test: explicitly use a `me` sstable The test currently implicitly uses the default sstable format. But it assumes that the index reader type is `sstables::index_reader`, and it wants some methods specific to that type (and absent from the base `abstract_index_reader`). If we switch the default format from `me` to `ms`, without doing something about this, this test will start failing on the downcast to `sstables::index_reader`. We deal with this by explicitly specifying `me`. `me` and `ms` data readers are identical. And this is a test of the data reader, not the index reader. So it's perfectly fine to just use `me`.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	fca56cb458	test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables This is an old test for some workaround for incorrectly-generated promoted indexes. It doesn't make sense to port this test to newer sstable formats. So just skip it for the new sstable versions.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	6143dce3db	test/boost/sstable_compaction_test: prepare for `ms` sstables. Fix incompatibilites between the test's assumptions and the upcoming addition of `ms` sstables. Refer to individual tests for comments.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	622149a183	test/boost/index_reader_test: prepare for `ms` sstables Adjust the incompatibilities between the test and the upcoming `ms` sstables. Refer to individual test for comments.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	a67d10d15d	test/boost/bloom_filter_tests: prepare for `ms` sstables The test for the bloom filter rebuild mechanism has to be adjusted, because `ms` sstables won't use this mechanism.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	312423fe53	test/boost/sstable_datafile_test: prepare for `ms` sstables The tests touched in this commit are concerned specifically with Summary. They are not applicable to sstables with BTI indexes.	2025-09-29 22:15:24 +02:00
Michał Chojnowski	924b8eec11	test/boost/sstable_test: prepare for `ms` sstables. Skip `ms` sstables in an uninteresting test which relies on `sstables::index_reader`.	2025-09-29 22:15:24 +02:00

... 8 9 10 11 12 ...

4728 Commits