scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 17:40:34 +00:00

Author	SHA1	Message	Date
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Michał Chojnowski	413dcf8891	sstables/trie: add BTI key translation routines This file provides translation routines for ring positions and clustering positions from Scylla's native in-memory structures to BTI's byte-comparable encoding. This translation is performed whenever a new decorated key or clustering block are added to a BTI index, and whenever a BTI index is queried for a range of positions. For a description of the encoding, see `fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)` The translation logic, with all the fragment awareness, lazy evaluation and avoidable copies, is fairly bloated for the common cases of simple and small keys. This is a potential optimization target for later.	2025-08-15 11:13:00 +02:00
Michał Chojnowski	5e76708335	tests/lib: extract generate_all_strings to test/lib This util will be used in another test file in a later commit, so hoist it to `test/lib`.	2025-08-14 22:38:38 +02:00
Michał Chojnowski	3017dbb204	sstables/trie: add trie traversal routines `trie::node_reader`, added in a previous series, contains encoding-aware logic for traversing a single node (or a batch of nodes) during a trie search. This commits adds encoding-agnostic functions which drive the the `trie::node_reader` in a loop to traverse the whole branch. Together, the added functions (`traverse`, `step`, `step_back`) and the data structure they modify (`ancestor_trail`) constitute a trie cursor. We might later wrap them into some `trie_cursor` class, but regardless of whether we are going to do that, keeping them (also) as free functions makes them easier to test. Closes scylladb/scylladb#25396	2025-08-11 19:15:09 +03:00
Asias He	1bf59ebba0	repair: Add incremental helpers This adds the helpers which are needed by both repair and compaction to add incremental repair support.	2025-08-11 10:10:08 +08:00
Avi Kivity	90eb6e6241	Merge 'sstables/trie: implement BTI node format serialization and traversal' from Michał Chojnowski This is the next part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25154 Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here. The new code added here is not used for anything yet, but it's posted as a separate PR to keep things reviewably small. This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes. It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR) into the on-disk format, and the logic for traversing the on-disk nodes during a read. New functionality, no backporting needed. Closes scylladb/scylladb#25317 * github.com:scylladb/scylladb: sstables/trie: add tests for BTI node serialization and traversal sstables/trie: implement BTI node traversal sstables/trie: implement BTI serialization utils/cached_file: add get_shared_page() utils/cached_file: replace a std::pair with a named struct	2025-08-07 12:15:42 +03:00
Michał Chojnowski	9930cd59eb	sstables/trie: add tests for BTI node serialization and traversal Adds tests which check that nodes serialized by `bti_node_sink` are readable by `bti_node_reader` with the right result. (Note: there are no tests which check compatibility of the encoded nodes with Cassandra or with handwritten hexdumps. There are only tests for mutual compatibility between Scylla's writers and readers. This can be considered a gap in testing.)	2025-08-05 21:48:24 +02:00
Avi Kivity	4c785b31c7	Merge 'List Alternator clients in system.clients virtual table' from Nadav Har'El Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code. Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field. Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator connections, we will list the currently active requests. The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients. Fixes #24993 This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request. Closes scylladb/scylladb#25178 * github.com:scylladb/scylladb: test/cqlpy: slightly strengthen test for system.clients generic_server: use utils::scoped_item_list docs/alternator: document the system.clients system table in Alternator alternator: add test for Alternator clients in system.clients alternator: list active Alternator requests in system.clients utils: unit test for utils::scoped_item_list utils: add a scoped_item_list utility class utils: add "fatal" version of utils::on_internal_error()	2025-08-05 15:55:41 +03:00
Michał Chojnowski	85964094f6	sstables/trie: implement BTI node traversal This commit implements routines for traversal of BTI nodes in their on-disk format. The `node_reader` concept is currently unused (i.e. not asserted by any template). It will only be used in the next PR, which will implement trie cursor routines parametrized `node_reader`. But I'm including it in this PR to make it clear which functions will be needed by the higher layer.	2025-08-05 00:56:50 +02:00
Michał Chojnowski	302adfb50d	sstables/trie: implement BTI serialization This commit introduces code responsibe for serializing trie nodes (`writer_node`) into the on-disk BTI format, as described in: `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md`	2025-08-05 00:56:50 +02:00
Avi Kivity	8b1bf46086	Merge 'sstables: introduce trie_writer' from Michał Chojnowski This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization). Refs scylladb/scylladb#19191 New functionality, no backporting needed. Closes scylladb/scylladb#25154 * github.com:scylladb/scylladb: sstables: introduce trie_writer utils/bit_cast: add object_representation()	2025-08-01 20:23:24 +03:00
Nadav Har'El	20b31987e1	utils: unit test for utils::scoped_item_list The previous test introduced a new utility class, utils::scoped_item_list. This patch adds a comprehensive unit test for the new class. We test basic usage of scoped_item_list, its size() and empty() methods, how items are removed from the list when their handle goes out of scope, how a handle's move constructor works, how items can be read and written through their handles, and finally that removing an item during a for_each_gently() iteration doesn't break the iteration. One thing I still didn't figure out how to properly test is how removing an item during multiple iterations that run concurrently fixes multiple iterators. I believe the code is correct there (we just have a list of ongoing iterations - instead of just one), but haven't found yet a way to reproduce this situation in a test. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Michał Chojnowski	c8682af418	sstables: introduce trie_writer This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).	2025-07-31 12:51:37 +02:00
Calle Wilund	43f7eecf9e	compress: move compress.cc/hh to sstables/compressor Fixes #22106 Moves the shared compress components to sstables, and rename to match class type. Adjust includes, removing redundant/unneeded ones where possible. Closes scylladb/scylladb#25103	2025-07-31 13:10:41 +03:00
Avi Kivity	1930f3e67f	Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space. However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position. For example, if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first index entry after key `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.) Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges. Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome. Preparation for new functionality, no backporting needed. Closes scylladb/scylladb#25093 * github.com:scylladb/scylladb: sstables/index_reader: weaken some exactness guarantees in abstract_index_reader test/boost: add a test for inexact index lookups sstables/mx/reader: allow passing a custom index reader to the constructor sstables/index_reader: remove advance_to sstables/mx/reader: handle inexact lookups in `advance_context()` sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` sstables/index_reader: make the return value of `get_partition_key` optional sstables/mx/reader: handle "backward jumps" in forward_to sstables/mx/reader: filter out partitions outside the queried range sstables/mx/reader: update _pr after `fast_forward_to`	2025-07-27 19:39:36 +03:00
Michał Chojnowski	be1f54c6d2	test/boost: add a test for inexact index lookups	2025-07-25 11:00:18 +02:00
Botond Dénes	837424f7bb	Merge 'Add Azure Key Provider for Encryption at Rest' from Nikos Dragazis This PR introduces a new Key Provider to support Azure Key Vault as a Key Management System (KMS) for Encryption at Rest. The core design principle is the same as in the AWS and GCP key providers - an externally provided Vault key that is used to protect local data encryption keys (a process known as "key wrapping"). In more detail, this patch series consists of: * Multiple Azure credential sources, offering a variety of authentication options (Service Principals, Managed Identities, environment variables, Azure CLI). * The Azure host - the Key Vault endpoint bridge. * The Azure Key Provider - the interface for the Azure host. * Unit tests using real Azure resources (credentials and Vault keys). * Log filtering logic to not expose sensitive data in the logs (plaintext keys, credentials, access tokens). This is part of the overall effort to support Azure deployments. Testing done: * Unit tests. * Manual test on an Azure VM with a Managed Identity. * Manual test with credentials from Azure CLI. * Manual test of `--azure-hosts` cmdline option. * Manual test of log filtering. Remaining items: - [x] Create necessary Azure resources for CI. - [x] Merge pipeline changes (https://github.com/scylladb/scylla-pkg/pull/5201). Closes https://github.com/scylladb/scylla-enterprise/issues/1077. New feature. No backport is needed. Closes scylladb/scylladb#23920 * github.com:scylladb/scylladb: docs: Document the Azure Key Provider test: Add tests for Azure Key Provider pylib: Add mock server for Azure Key Vault encryption: Define and enable Azure Key Provider encryption: azure: Delegate hosts to shard 0 encryption: Add Azure host cache encryption: Add config options for Azure hosts encryption: azure: Add override options encryption: azure: Add retries for transient errors encryption: azure: Implement init() encryption: azure: Implement get_key_by_id() encryption: azure: Add id-based key cache encryption: azure: Implement get_or_create_key() encryption: azure: Add credentials in Azure host encryption: azure: Add attribute-based key cache encryption: azure: Add skeleton for Azure host encryption: Templatize get_{kmip,kms,gcp}_host() encryption: gcp: Fix typo in docstring utils: azure: Get access token with default credentials utils: azure: Get access token from Azure CLI utils: azure: Get access token from IMDS utils: azure: Get access token with SP certificate utils: azure: Get access token with SP secret utils: rest: Add interface for request/response redaction logic utils: azure: Declare all Azure credential types utils: azure: Define interface for Azure credentials utils: Introduce base64url_{encode,decode}	2025-07-25 10:45:32 +03:00
Ernest Zaslavsky	d2c5765a6b	treewide: Move keys related files to a new keys directory As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system. Moved files: - clustering_bounds_comparator.hh - keys.cc - keys.hh - clustering_interval_set.hh - clustering_key_filter.hh - clustering_ranges_walker.hh - compound_compat.hh - compound.hh - full_position.hh Fixes: #22102 Fixes: #22103 Fixes: #22105 Closes scylladb/scylladb#25082	2025-07-25 10:45:32 +03:00
Nikos Dragazis	41b63469e1	encryption: Define and enable Azure Key Provider Define the Azure Key Provider to connect the core EaR business logic with the Azure-based Key Management implementation (Azure host). Introduce "AzureKeyProviderFactory" as a new `key_provider` value in the configuration. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:09 +03:00
Nikos Dragazis	b39d1b195e	encryption: azure: Add skeleton for Azure host The Azure host manages cryptographic keys using Azure Key Vault. This patch only defines the API. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	0d0135dc4c	utils: azure: Declare all Azure credential types The goal is to mimic the Azure C++ SDK, which offers a variety of credentials, depending on their type and source. Declare the following credentials: * Service Principal credentials * Managed Identity credentials * Azure CLI credentials * Default credentials Also, define a common exception for SP and MI credentials which are network-based. This patch only defines the API. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	3c4face47b	utils: azure: Define interface for Azure credentials Azure authentication is token based - the client obtains an access token with their credentials, and uses it as a bearer token to authorize requests to Azure services. Define a common API for all credential types. The API will consist of a single `get_access_token()` function that will be returning a new or a cached access token for some resource URI (defines token scope). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Botond Dénes	26f135a55a	Merge 'Make KMIP host do nice TLS close on dropped connection + make PyKMIP test fixure not generate TLS noise + remove boost::process' from Calle Wilund Fixes #24873 In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors. This just makes sure `release` closes the connection if it neither retains or caches it. Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors. Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs. This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency. Closes scylladb/scylladb#24874 * github.com:scylladb/scylladb: encryption_test: Make PyKMIP run under seastar::experimental::process test/lib: Add wrapper helper for test process fixtures kmip_host: Close connections properly if dropped by pool being full encryption_at_rest_test: Do port check using TLS	2025-07-15 06:55:34 +03:00
Calle Wilund	722e2bce96	encryption_test: Make PyKMIP run under seastar::experimental::process Removes the requirement of boost::process, and all its non-seastar-ness. Hopefully also makes the IO and shutdown handling a bit more reliable.	2025-07-14 12:18:16 +00:00
Calle Wilund	253323bb64	test/lib: Add wrapper helper for test process fixtures Adds a wrapper for seastar::experimental::process, to help use external process fixtures in unit test. Mainly to share concepts such as line reading of stdout/err etc, and sync the shutdown of these. Also adds a small path searcher to find what you want to run.	2025-07-14 12:18:16 +00:00
Avi Kivity	6fce817aa8	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit(). Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Fixes https://github.com/scylladb/scylladb/issues/24531 Closes scylladb/scylladb#24886 [avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>] * github.com:scylladb/scylladb: test: add type creation to test_snapshot storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-07-13 20:47:55 +03:00
Marcin Maliszkiewicz	19bc6ffcb0	replica: make truncate_table_on_all_shards get whole schema from table_shards Before for views and indexes it was fetching base schema from db (and couple other properties). This is a problem once we introduce atomic tables and views deletion (in the following commit). Because once we delete table it can no longer be fetched from db object, and truncation is performed after atomically deleting all relevant tables/views/indexes. Now the whole relevant schema will be fetched via global_table_ptr (table_shards) object.	2025-07-10 10:40:43 +02:00
Pawel Pery	7bf53fc908	vector_store_client: implement initial vector_store_client service This patch is a part of vector_store_client sharded service implementation for a communication with vector-store service. It adds a `services/vector_store_client.{cc\|hh}` sharded service and a configuration parameter `vector_store_uri` with a `http://vector-store.dns.name:port` format. If there will be an error during parsing that parameter there will be an exception during construction. For the future unit testing purposes the patch adds `vector_store_client_tester` as a way to inject mockup functionality. This service will be used by the select statements for the Vector search indexes (see VS-46). For this reason I've added vector_store_client service in the query processor. Reference: VS-47 VS-45	2025-07-08 16:29:55 +02:00
Yaniv Michael Kaul	82fba6b7c0	PowerPC: remove ppc stuff We don't even compile-test it. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#24659	2025-07-08 10:38:23 +03:00
Lakshmi Narayanan Sreethar	74c556a33d	types: introduce comparable_bytes class This patch implements a new class, `comparable_bytes`, designed to implement methods for converting data values to and from byte-comparable formats. The class stores the comparable bytes as `managed_bytes` and currently provides the structure for all required methods. The actual logic for converting various data types will be implemented in subsequent patches. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Avi Kivity	07c5edcc30	tools: add patchelf utility We use patchelf to rewrite the dynamic loader (known as the interpreter) of the binaries we ship, so we can point to our shipped dynamic loader, which is compatible with our binaries, rather than rely on the distribution's dynamic loader, which is likely to be incompatible. Upstream patchelf losing compatibity [1] with Linux 5.17 and below. This change was also picked up by Fedora 42, so we cannot update the toolchain to that distribution until we have an alternative. Here we add a minimal patchelf alternative. It was mostly written by Claude. It is minimal in that it only supports --set-interpreter and --print-interpreter, and works well enough for our needs. We still use the original patchelf for --remove-rpath; this reduces our maintenance needs. [1] `43b75fbc9f` [2] `4b015255d1` Closes scylladb/scylladb#24695	2025-06-30 07:24:05 +03:00
Avi Kivity	b33dd2bd7d	Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. Closes scylladb/scylladb#24492 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-06-29 18:18:36 +03:00
Kefu Chai	e212b1af0c	build: add p11-kit's cflags to user_cflags instead of args.user_cflags Fix an issue introduced in commit `083f7353` where p11-kit's compiler flags were incorrectly added to `args.user_cflags` instead of `user_cflags`. This created the following problem: When using CMake generation mode, these flags were added to `CMAKE_CXX_FLAGS`, causing them to be passed to all compiler invocations including linking stages where they were irrelevant. This change moves p11-kit's cflags to `user_cflags`, which ensures the flags are correctly included in compilation commands but not in linking commands. This maintains the proper behavior in the ninja build system while fixing the issue in the CMake build system. `args.user_cflags` is preserved for its intended purpose of storing user-specified compiler flags passed via command line options. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23988	2025-06-25 11:24:09 +03:00
Botond Dénes	3e1c50e9a7	db: introduce corrupt_data_handler Similar to large_data_handler, this interface allows sstable writers to delegate the handling of corrupt data. Two implementations are provided: * system_table_corrupt_data_handler - saved corrupt data in system.corrupt_data, with a TTL=10days (non-configurable for now) * nop_corrupt_data_handler - drops corrupt data	2025-06-24 14:57:00 +03:00
Botond Dénes	b0d5462440	idl: extract full_position.idl from position_in_partition.idl A future user of position_in_partition.idl doesn't need full_position and so doesn't want to include full_position.hh to fix compile errors when including position_in_partition.idl.hh. Extract it to a separate idl file: it has a single user in a storage_proxy VERB.	2025-06-24 11:05:30 +03:00
Avi Kivity	cd79a8fc25	Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz" This reverts commit `0b516da95b`, reversing changes made to `30199552ac`. It breaks cluster.random_failures.test_random_failures.test_random_failures in debug mode (at least). Fixes #24513	2025-06-16 22:38:12 +03:00
Tomasz Grabiec	0b516da95b	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`. Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Closes scylladb/scylladb#20853 * github.com:scylladb/scylladb: storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-06-10 13:45:32 +02:00
Calle Wilund	80feb8b676	utils::http::dns_connection_factory: Use a shared certificate_credentials Fixes #24447 This factory type, which is really more a data holder/connection producer per connection instance, creates, if using https, a new certificate_credentials on every instance. Which when used by S3 client is per client and scheduling groups. Which eventually means that we will do a set_system_trust + "cold" handshake for every tls connection created this way. This will cause both IO and cold/expensive certificate checking -> possible stalls/wasted CPU. Since the credentials object in question is literally a "just trust system", it could very well be shared across the shard. This PR adds a thread local static cached credentials object and uses this instead. Could consider moving this to seastar, but maybe this is too much. Closes scylladb/scylladb#24448	2025-06-10 11:20:21 +03:00
Marcin Maliszkiewicz	a27776b4ff	replica: make truncate_table_on_all_shards get whole schema from table_shards Before for views and indexes it was fetching base schema from db (and couple other properties). This is a problem once we introduce atomic tables and views deletion (in the following commit). Because once we delete table it can no longer be fetched from db object, and truncation is performed after atomically deleting all relevant tables/views/indexes. Now the whole relevant schema will be fetched via global_table_ptr (table_shards) object.	2025-06-06 08:50:33 +02:00
Calle Wilund	942477ecd9	encryption/utils: Move encryption httpclient to "general" REST client Fixed #24296 While the HTTP client used for REST calls in AWS/GCP KMS integration (EAR) is not general enough to be called a HTTP client as such, it is general enough to be called a REST client (limited to stateless, single-op REST calls). Other code, like general auth integrations (hello Azure) and similar could reuse this to lessen code duplication. This patch simply moves the httpclient class from encryption to "rest" namespace, and explicitly "limits" it to such usage. Making an alias in encryption to avoid touching more files than needed. Closes scylladb/scylladb#24297	2025-05-30 12:21:51 +03:00
Michał Hudobski	3ab643a5de	vector_index: add custom index and vector index classes In this patch we add an abstract class, "custom_index", with a validate() method. Each CUSTOM INDEX class needs to implement a concrete subclass of custom_index which is used to validate if this type of custom index class may be used, and whether the optional parameters passed to it are valid. We change the existing CUSTOM INDEX validation code to use this new mechanism. Finally this patch implements one concrete subclass for vector index. Before this patch, the custom index type "vector_index" was allowed, but after this patch it gains more validation of its optional parameters (we support 4 specific parameters, with some rules on their values). Of course, the vector index isn't actually implemented in this patch, we are just improving the validation of the index creation statement.	2025-05-27 21:04:50 +02:00
Petr Gusev	0443081b0d	build: fix merge-compdb.py for CMake 'output' attributes compile_commands.json is used by LSPs (e.g. `clangd` in VS Code) for code navigation. `merge-compdb.py`, called by `configure.py`, merges these files from Scylla, Seastar, and Abseil. The script filters entries by checking the output attribute against a given prefix. This is needed because Scylla’s compile_commands.json is generated by Ninja and includes all build modes, in case the user specified multiple ones in the call to configure.py. Seastar and Abseil databases, generated by CMake, used to omit the output attribute, so filtering did not apply. Starting with `CMake 3.20+`, output attributes are now included and do not match the expected prefix. For example, they could be of the form `absl/synchronization/CMakeFiles/synchronization.dir/internal/futex_waiter.cc.o`. This causes relevant entries from Seastar and Abseil to be filtered out. This patch refactors `merge-compdb.py` to allow specifying an optional prefix per input file, preserving the intent of applying the output filtering logic only for ninja-generated Scylla compdb file. Closes scylladb/scylladb#24211	2025-05-20 08:43:09 +03:00
Nadav Har'El	248688473d	build: when compiling without -g, don't leave debugging information If Scylla is compiled without "-g" (this is, for example, the default in dev build mode), any static library that we link with it and contains any debugging information will cause the resulting executable to incorrectly look (e.g., to file(1) or to gdb) like it has debugging information. For more than three years now (see #10863 for historical context), the wasmtime.a library, which has debugging symbols, has caused this to happen. In this patch, if a certain build is compiled WITHOUT "-g", we add the "--strip-debug" option to the linker to remove the partial debugging information from the executable. Note that --strip-debug is not added in build modes which do use "-g", or if the user explicitly asked to add -g (e.g., "configure.py --cflags=-g"). Before this patch: $ file build/dev/scylla build/dev/scylla: ELF 64-bit LSB executable ... , with debug_info, not stripped Ater this patch: $ file build/dev/scylla build/dev/scylla: ELF 64-bit LSB executable ... , not stripped Fixes #23832. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23840	2025-05-12 15:42:17 +03:00
Michał Chojnowski	f075674ebe	test: add test/boost/sstable_compressor_factory_test Add a basic test for NUMA awareness of `default_sstable_compressor_factory`.	2025-05-07 14:43:20 +02:00
Avi Kivity	5e1cf90a51	build: replace tools/java submodule with packaged cassandra-stress We no longer use tools/java (scylladb/scylla-tools-java.git) for nodetool or cqlsh; only cassandra-stress. Since that is available in package form install that and excise the tools/java submodule from the source tree. pgo/ is adjusted to use the packaged cassandra-stress (and the cqlsh submodule). A few jmx references are dropped as well. Frozen toolchain regenerated. Optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Closes scylladb/scylladb#23698	2025-04-15 10:11:28 +03:00
Patryk Jędrzejczak	07a7a75b98	Merge 'raft: implement the limited voters feature' from Emil Maskovsky Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures. Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested). Tests added: * boost/group0_voter_registry_test.cc: run time on CI: ~3.5s * topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total Fixes: scylladb/scylladb#18793 No backport: This is a new feature that will not be backported. Closes scylladb/scylladb#21969 * https://github.com/scylladb/scylladb: raft: distribute voters by rack inside DC raft/test: fix lint warnings in `test_raft_no_quorum` raft/test: add the upgrade test for limited voters feature raft topology: handle on_up/on_down to add/remove node from voters raft: fix the indentation after the limited voters changes raft: implement the limited voters feature raft: drop the voter removal from the decommission raft/test: disable the `stop_before_becoming_raft_voter` test raft/test: stop the server less gracefully in the voters test	2025-04-10 15:29:15 +02:00
Avi Kivity	ed3e4f33fd	Merge 'generic_server: throttle and shed incoming connections according to semaphore limit' from Marcin Maliszkiewicz Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency. It should help to reduce cpu usage by limiting cpu concurrency for new connections. As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed. New connections_shed and connections_blocked metrics are added for tracking. Testing: - manually via simple program creating high number of connection and constantly re-connecting - added benchmark Following are benchmark results: Before: ``` > build/release/test/perf/perf_generic_server --smp=1 170101.41 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4695 insns/op, 3178 cycles/op, 0 errors) [...] throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54 instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96 cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15 ``` After: ``` > build/release/test/perf/perf_generic_server --smp=1 167373.19 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4821 insns/op, 3371 cycles/op, 0 errors) [...] throughput: mean= 171199.55 standard-deviation=2484.58 median= 171667.06 median-absolute-deviation=2087.63 maximum=173689.11 minimum=167904.76 instructions_per_op: mean= 4801.90 standard-deviation=16.54 median= 4796.78 median-absolute-deviation=9.32 maximum=4830.71 minimum=4789.81 cpu_cycles_per_op: mean= 3245.26 standard-deviation=32.28 median= 3230.44 median-absolute-deviation=16.52 maximum=3297.39 minimum=3215.62 ``` The patch adds around 67 insns/op so it's effect on performance should be negligible. Fixes: https://github.com/scylladb/scylladb/issues/22844 Closes scylladb/scylladb#22828 * github.com:scylladb/scylladb: transport: move on_connection_close into connection destructor test: perf: make aggregated_perf_results formatting more human readable transport: add blocked and shed connection metrics generic_server: throttle and shed incoming connections according to semaphore limit generic_server: add data source and sink wrappers bookkeeping network IO generic_server: coroutinize part of server::do_accepts test: add benchmark for generic_server test: perf: add option to count multiple ops per time_parallel iteration generic_server: add semaphore for limiting new connections concurrency generic_server: add config to the constructor generic_server: add on_connection_ready handler	2025-04-09 21:41:38 +03:00
Marcin Maliszkiewicz	719d04d501	test: add benchmark for generic_server Changes in configure.py are needed becuase we don't want to embed this benchmark in scylla binary as perf_simple_query or perf_alternator, it doesn't directly translate to Scylla performance but we want to use aggregated_perf_results for precise cpu measurements so we need different dependecies.	2025-04-09 10:48:42 +02:00
Robert Bindar	4e3eb2fdac	Move direct_failure_detector from root to service/ direct_failure_detector used to be used by gms/ as well, but that's not the case anymore, so raft/ is the only user. Fixes #23133 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#23248	2025-04-08 13:03:24 +03:00
Emil Maskovsky	1d06ea3a5a	raft: implement the limited voters feature Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of datacenters (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose the majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. Currently the voter limits will not be configurable (we might introduce configurable limits later if that would be needed/requested). The feature is enabled by the `group0_limited_voters` feature flag to avoid issues with cluster upgrade (the feature will be only enabled once all nodes in the cluster are upgraded to the version supporting the feature). Fixes: scylladb/scylladb#18793	2025-04-07 12:31:18 +02:00

1 2 3 4 5 ...

2065 Commits