scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 04:06:59 +00:00

Author	SHA1	Message	Date
copilot-swe-agent[bot]	7dc72ff9a5	Fix unsigned integer underflow in item length calculation Co-authored-by: nyh <584227+nyh@users.noreply.github.com>	2025-08-10 11:46:43 +00:00
copilot-swe-agent[bot]	31c79f6cd7	Fix critical bugs in base64 implementation - array bounds and substr length Co-authored-by: nyh <584227+nyh@users.noreply.github.com>	2025-08-10 11:41:57 +00:00
copilot-swe-agent[bot]	5c7d63da6a	Fix critical bugs in alternator get_magnitude_and_precision function Co-authored-by: nyh <584227+nyh@users.noreply.github.com>	2025-08-10 11:38:22 +00:00
Avi Kivity	8164f72f6e	Merge 'Separate local_effective_replication_map from vnode_effective_replication_map' from Benny Halevy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Refs #22733 * No backport required Closes scylladb/scylladb#25222 * github.com:scylladb/scylladb: locator: abstract_replication_strategy: implement local_replication_strategy locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently locator: abstract_replication_map: rename make_effective_replication_map locator: abstract_replication_map: rename calculate_effective_replication_map replica: database: keyspace: rename {create,update}_effective_replication_map locator: effective_replication_map_factory: rename create_effective_replication_map locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al locator: abstract_replication_strategy: rename global_vnode_effective_replication_map keyspace: rename get_vnode_effective_replication_map dht: range_streamer: use naked e_r_m pointers storage_service: use naked e_r_m pointers alternator: ttl: use naked e_r_m pointers locator: abstract_replication_strategy: define is_local	2025-08-07 12:51:43 +03:00
Avi Kivity	90eb6e6241	Merge 'sstables/trie: implement BTI node format serialization and traversal' from Michał Chojnowski This is the next part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25154 Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here. The new code added here is not used for anything yet, but it's posted as a separate PR to keep things reviewably small. This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes. It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR) into the on-disk format, and the logic for traversing the on-disk nodes during a read. New functionality, no backporting needed. Closes scylladb/scylladb#25317 * github.com:scylladb/scylladb: sstables/trie: add tests for BTI node serialization and traversal sstables/trie: implement BTI node traversal sstables/trie: implement BTI serialization utils/cached_file: add get_shared_page() utils/cached_file: replace a std::pair with a named struct	2025-08-07 12:15:42 +03:00
Benny Halevy	6dbbb80aae	locator: abstract_replication_strategy: implement local_replication_strategy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Note that everywhere_replication_strategy is not abstracted in a similar way, although it could, since the plan is to get rid of it once all system keyspaces areconverted to local or tablets replication (and propagated everywhere if needed using raft group0) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:05:11 +03:00
Benny Halevy	babb4a41a8	locator: abstract_replication_map: rename calculate_effective_replication_map to calculate_vnode_effective_replication_map since it is specific to vnode-based range calculations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	cbad497859	locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al to static_effective_replication_map_ptr, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	bd62421c05	keyspace: rename get_vnode_effective_replication_map to get_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map (both are per-keyspace). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:40:43 +03:00
Pavel Emelyanov	0616407be5	Merge 'rest_api: add endpoint which drops all quarantined sstables' from Taras Veretilnyk Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API. This endpoint allows dropping all quarantined SSTables either globally or for a specific keyspace and tables. Optional query parameters `keyspace` and `tables` (comma-separated table names) can be provided to limit the scope of the operation. Fixes scylladb/scylladb#19061 Backport is not required, it is new functionality Closes scylladb/scylladb#25063 * github.com:scylladb/scylladb: docs: Add documentation for the nodetool dropquarantinedsstables command nodetool: add command for dropping quarantine sstables rest_api: add endpoint which drops all quarantined sstables	2025-08-06 11:55:15 +03:00
Karol Nowacki	032e8f9030	test/boost/vector_store_client_test.cc: Fix flaky tests The vector_store_client_test was observed to be flaky, sometimes hanging while waiting for a response from HTTP server. Problem: The default load balancing algorithm (in Seastar's posix_server_socket_impl::accept) could route an incoming connection to a different shard than the one executing the test. Because the HTTP server is a non-sharded service running only on the test's originating shard, any connection submitted to another shard would never be handled, causing the test client to hang waiting for response. Solution: The patch resolves the issue by explicitly setting fixed cpu load balancing algorithm. This ensures that incoming connections are always handled on the same shard where the HTTP server is running. Closes scylladb/scylladb#25314	2025-08-06 11:24:51 +03:00
Michał Chojnowski	9930cd59eb	sstables/trie: add tests for BTI node serialization and traversal Adds tests which check that nodes serialized by `bti_node_sink` are readable by `bti_node_reader` with the right result. (Note: there are no tests which check compatibility of the encoded nodes with Cassandra or with handwritten hexdumps. There are only tests for mutual compatibility between Scylla's writers and readers. This can be considered a gap in testing.)	2025-08-05 21:48:24 +02:00
Pavel Emelyanov	10056a8c6d	Merge 'Simplify credential reload: remove internal expiration checks' from Ernest Zaslavsky This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm. To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does. Fixes: https://github.com/scylladb/scylladb/issues/25044 Should be backported to 2025.3 since we need this fix for the restore Closes scylladb/scylladb#24961 * github.com:scylladb/scylladb: s3_creds: code cleanup s3_creds: Make `reload` unconditional s3_creds: Add test exposing credentials renewal issue	2025-08-05 17:49:13 +03:00
Avi Kivity	4c785b31c7	Merge 'List Alternator clients in system.clients virtual table' from Nadav Har'El Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code. Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field. Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator connections, we will list the currently active requests. The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients. Fixes #24993 This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request. Closes scylladb/scylladb#25178 * github.com:scylladb/scylladb: test/cqlpy: slightly strengthen test for system.clients generic_server: use utils::scoped_item_list docs/alternator: document the system.clients system table in Alternator alternator: add test for Alternator clients in system.clients alternator: list active Alternator requests in system.clients utils: unit test for utils::scoped_item_list utils: add a scoped_item_list utility class utils: add "fatal" version of utils::on_internal_error()	2025-08-05 15:55:41 +03:00
Ernest Zaslavsky	e4ebe6a309	s3_creds: Make `reload` unconditional Assume that any caller invoking `reload` intends to refresh credentials. Remove conditional logic that checks for expiration before reloading.	2025-08-03 17:41:35 +03:00
Ernest Zaslavsky	68855c90ca	s3_creds: Add test exposing credentials renewal issue Add a test demonstrating that renewing credentials does not update their expiration. After requesting credentials again, the expiration remains unchanged, indicating no actual update occurred.	2025-08-03 17:41:25 +03:00
Avi Kivity	8b1bf46086	Merge 'sstables: introduce trie_writer' from Michał Chojnowski This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization). Refs scylladb/scylladb#19191 New functionality, no backporting needed. Closes scylladb/scylladb#25154 * github.com:scylladb/scylladb: sstables: introduce trie_writer utils/bit_cast: add object_representation()	2025-08-01 20:23:24 +03:00
Nikos Dragazis	2656fca504	test: Use in-memory SQLite for PyKMIP server The PyKMIP server uses an SQLite database to store artifacts such as encryption keys. By default, SQLite performs a full journal and data flush to disk on every CREATE TABLE operation. Each operation triggers three fdatasync(2) calls. If we multiply this by 16, that is the number of tables created by the server, we get a significant number of file syncs, which can last for several seconds on slow machines. This behavior has led to CI stability issues from KMIP unit tests where the server failed to complete its schema creation within the 20-second timeout (observed on spider9 and spider11). Fix this by configuring the server to use an in-memory SQLite. Fixes #24842. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#24995	2025-08-01 12:11:27 +03:00
Nadav Har'El	20b31987e1	utils: unit test for utils::scoped_item_list The previous test introduced a new utility class, utils::scoped_item_list. This patch adds a comprehensive unit test for the new class. We test basic usage of scoped_item_list, its size() and empty() methods, how items are removed from the list when their handle goes out of scope, how a handle's move constructor works, how items can be read and written through their handles, and finally that removing an item during a for_each_gently() iteration doesn't break the iteration. One thing I still didn't figure out how to properly test is how removing an item during multiple iterations that run concurrently fixes multiple iterators. I believe the code is correct there (we just have a list of ongoing iterations - instead of just one), but haven't found yet a way to reproduce this situation in a test. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Michał Chojnowski	c8682af418	sstables: introduce trie_writer This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).	2025-07-31 12:51:37 +02:00
Calle Wilund	43f7eecf9e	compress: move compress.cc/hh to sstables/compressor Fixes #22106 Moves the shared compress components to sstables, and rename to match class type. Adjust includes, removing redundant/unneeded ones where possible. Closes scylladb/scylladb#25103	2025-07-31 13:10:41 +03:00
Pavel Emelyanov	34608450c5	Merge 'qos: don't populate effective service level cache until auth is migrated to raft' from Piotr Dulikowski Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart). Closes scylladb/scylladb#25188 * github.com:scylladb/scylladb: test: sl: verify that legacy auth is not queried in sl to raft upgrade qos: don't populate effective service level cache until auth is migrated to raft	2025-07-31 13:05:27 +03:00
Botond Dénes	2985c343ed	Merge 'repair: Avoid too many fragments in a single repair_row_on_wire' from Asias He When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message. This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression. This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream. Tests are added to make sure the message split works. Fixes #24808 Closes scylladb/scylladb#25002 * github.com:scylladb/scylladb: repair: Avoid too many fragments in a single repair_row_on_wire repair: Change partition_key_and_mutation_fragments to use chunked_vector utils: Allow chunked_vector::erase to work with non-default-constructible type	2025-07-29 17:45:57 +03:00
Piotr Dulikowski	2bb800c004	qos: don't populate effective service level cache until auth is migrated to raft Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963	2025-07-29 11:37:37 +02:00
Asias He	266a518e4c	repair: Change partition_key_and_mutation_fragments to use chunked_vector With the change in "repair: Avoid too many fragments in a single repair_row_on_wire", the std::list<frozen_mutation_fragment> _mfs; in partition_key_and_mutation_fragments will not contain large number of fragments any more. Switch to use chunked_vector.	2025-07-29 13:43:17 +08:00
Botond Dénes	f3ed27bd9e	Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov Nowadays the way to configure an internal service is 1. service declares its config struct 2. caller (main/test/tool) fills the respective config with values it wants 3. the service is started with the config passed by value The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config. For the reference: similar changes with other services: #23705 , #20174 , #19166 Closes scylladb/scylladb#25118 * github.com:scylladb/scylladb: gms,init: Move get_disabled_features_from_db_config() from gms code: Update callers generating feature service config gms: Make feature_config a simple struct gms: Split feature_config_from_db_config() into two	2025-07-29 08:17:49 +03:00
Taras Veretilnyk	fa98239ed8	rest_api: add endpoint which drops all quarantined sstables Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API. This endpoint allows dropping all quarantined SSTables either globally or for a specific keyspace and tables. Optional query parameters `keyspace` and `tables` (comma-separated table names) can be provided to limit the scope of the operation. Fixes scylladb/scylladb#19061	2025-07-28 16:55:17 +02:00
Avi Kivity	1930f3e67f	Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space. However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position. For example, if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first index entry after key `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.) Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges. Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome. Preparation for new functionality, no backporting needed. Closes scylladb/scylladb#25093 * github.com:scylladb/scylladb: sstables/index_reader: weaken some exactness guarantees in abstract_index_reader test/boost: add a test for inexact index lookups sstables/mx/reader: allow passing a custom index reader to the constructor sstables/index_reader: remove advance_to sstables/mx/reader: handle inexact lookups in `advance_context()` sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` sstables/index_reader: make the return value of `get_partition_key` optional sstables/mx/reader: handle "backward jumps" in forward_to sstables/mx/reader: filter out partitions outside the queried range sstables/mx/reader: update _pr after `fast_forward_to`	2025-07-27 19:39:36 +03:00
Avi Kivity	8180cbcf48	Merge 'tablets: prevent accidental copy of tablets_map' from Benny Halevy As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. * minor optimization, no backport needed Closes scylladb/scylladb#24978 * github.com:scylladb/scylladb: tablets: prevent accidental copy of tablets_map locator: tablets: get rid of synchronous mutate_tablet_map	2025-07-27 16:48:27 +03:00
Michał Chojnowski	be1f54c6d2	test/boost: add a test for inexact index lookups	2025-07-25 11:00:18 +02:00
Michał Chojnowski	fe8ee34024	sstables/index_reader: remove advance_to `advance_to` is unused now, so remove it.	2025-07-25 11:00:18 +02:00
Botond Dénes	837424f7bb	Merge 'Add Azure Key Provider for Encryption at Rest' from Nikos Dragazis This PR introduces a new Key Provider to support Azure Key Vault as a Key Management System (KMS) for Encryption at Rest. The core design principle is the same as in the AWS and GCP key providers - an externally provided Vault key that is used to protect local data encryption keys (a process known as "key wrapping"). In more detail, this patch series consists of: * Multiple Azure credential sources, offering a variety of authentication options (Service Principals, Managed Identities, environment variables, Azure CLI). * The Azure host - the Key Vault endpoint bridge. * The Azure Key Provider - the interface for the Azure host. * Unit tests using real Azure resources (credentials and Vault keys). * Log filtering logic to not expose sensitive data in the logs (plaintext keys, credentials, access tokens). This is part of the overall effort to support Azure deployments. Testing done: * Unit tests. * Manual test on an Azure VM with a Managed Identity. * Manual test with credentials from Azure CLI. * Manual test of `--azure-hosts` cmdline option. * Manual test of log filtering. Remaining items: - [x] Create necessary Azure resources for CI. - [x] Merge pipeline changes (https://github.com/scylladb/scylla-pkg/pull/5201). Closes https://github.com/scylladb/scylla-enterprise/issues/1077. New feature. No backport is needed. Closes scylladb/scylladb#23920 * github.com:scylladb/scylladb: docs: Document the Azure Key Provider test: Add tests for Azure Key Provider pylib: Add mock server for Azure Key Vault encryption: Define and enable Azure Key Provider encryption: azure: Delegate hosts to shard 0 encryption: Add Azure host cache encryption: Add config options for Azure hosts encryption: azure: Add override options encryption: azure: Add retries for transient errors encryption: azure: Implement init() encryption: azure: Implement get_key_by_id() encryption: azure: Add id-based key cache encryption: azure: Implement get_or_create_key() encryption: azure: Add credentials in Azure host encryption: azure: Add attribute-based key cache encryption: azure: Add skeleton for Azure host encryption: Templatize get_{kmip,kms,gcp}_host() encryption: gcp: Fix typo in docstring utils: azure: Get access token with default credentials utils: azure: Get access token from Azure CLI utils: azure: Get access token from IMDS utils: azure: Get access token with SP certificate utils: azure: Get access token with SP secret utils: rest: Add interface for request/response redaction logic utils: azure: Declare all Azure credential types utils: azure: Define interface for Azure credentials utils: Introduce base64url_{encode,decode}	2025-07-25 10:45:32 +03:00
Ernest Zaslavsky	d2c5765a6b	treewide: Move keys related files to a new keys directory As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system. Moved files: - clustering_bounds_comparator.hh - keys.cc - keys.hh - clustering_interval_set.hh - clustering_key_filter.hh - clustering_ranges_walker.hh - compound_compat.hh - compound.hh - full_position.hh Fixes: #22102 Fixes: #22103 Fixes: #22105 Closes scylladb/scylladb#25082	2025-07-25 10:45:32 +03:00
Calle Wilund	a86e8d73f2	encryption_at_rest_test: ensure proxy connection flushing Refs #24551 Drops background flush for proxy output stream (because test), and also ensures we do explicit flush + close on exception in write loop. Ensures we don't hide actual exceptions with asserts. Closes scylladb/scylladb#25146	2025-07-25 10:45:32 +03:00
Benny Halevy	fce6c4b41d	tablets: prevent accidental copy of tablets_map As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:07:26 +03:00
Benny Halevy	dee0d7ffbf	locator: tablets: get rid of synchronous mutate_tablet_map It is currently used only by tests that could very well do with mutate_tablet_map_async. This will simplify the following patch to prevent accidental copy of the tablet_map, provding explicit clone/clone_gently methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:03:02 +03:00
Pavel Emelyanov	8220974e76	code: Update callers generating feature service config Instead of requesting it from gms code, create it "by hand" with the help of get_disabled_features_from_db_config() method. This is how other services are configured by main/tools/testing code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-21 19:19:09 +03:00
Ernest Zaslavsky	408aa289fe	treewide: Move misc files to `utils` directory As requested in #22114, moved the files and fixed other includes and build system. Moved files: - interval.hh - Map_difference.hh Fixes: #22114 This is a cleanup, no need to backport Closes scylladb/scylladb#25095	2025-07-21 11:56:40 +03:00
Piotr Dulikowski	7fd97e6a93	Merge 'cdc: Forbid altering columns of CDC log tables directly' from Dawid Mędrek The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643 Backport: we should backport the change to all affected branches to prevent the consequences that may affect the user. Closes scylladb/scylladb#25008 * github.com:scylladb/scylladb: cdc: Forbid altering columns of inactive CDC log table cdc: Forbid altering columns of CDC log tables directly	2025-07-21 09:31:00 +02:00
Avi Kivity	3dfdcf7d7a	Merge 'transport: remove throwing `protocol_exception` on connection start' from Dario Mirovic `protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future. This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future. There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it. Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance. In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test. Testing Build: `release` Test file: `test/cqlpy/test_protocol_exceptions.py` Test name: `test_protocol_version_mismatch` (modified for mass connection requests) Test arguments: ``` max_attempts=100'000 num_parallel=10 ``` Throwing `protocol_exception` results: ``` real=1:26.97 user=10:00.27 sys=2:34.55 cpu=867% real=1:26.95 user=9:57.10 sys=2:32.50 cpu=862% real=1:26.93 user=9:56.54 sys=2:35.59 cpu=865% real=1:26.96 user=9:54.95 sys=2:32.33 cpu=859% real=1:26.96 user=9:53.39 sys=2:33.58 cpu=859% real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862% # average ``` Returning `protocol_exception` as `result_with_exception` or an exceptional future: ``` real=1:18.46 user=9:12.21 sys=2:19.08 cpu=881% real=1:18.44 user=9:04.03 sys=2:17.91 cpu=869% real=1:18.47 user=9:12.94 sys=2:19.68 cpu=882% real=1:18.49 user=9:13.60 sys=2:19.88 cpu=883% real=1:18.48 user=9:11.76 sys=2:17.32 cpu=878% real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879% # average ``` This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567. Refs: #24567 This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting. Closes scylladb/scylladb#24738 * github.com:scylladb/scylladb: test/cqlpy: add cpp exception metric test conditions transport/server: replace protocol_exception throws with returns utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception transport/server: avoid exception-throw overhead in handle_error test/cqlpy: add protocol_exception tests	2025-07-20 17:42:30 +03:00
Dario Mirovic	9f4344a435	utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception Make make_bytes_ostream and make_fragmented_temporary_buffer accept writer callbacks that return utils::result_with_exception instead of forcing them to throw on error. This lets callers propagate failures by returning an error result rather than throwing an exception. Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer concepts to simplify and document the template requirements on writer callbacks. This patch does not modify the actual callbacks passed, except for the syntax changes needed for successful compilation, without changing the logic. Refs: #24567	2025-07-17 16:40:02 +02:00
Botond Dénes	20693edb27	Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to). In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface. Later, we will add BTI indexes which will also implement this interface. This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`. Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes. No backports needed, this is a preparation for new functionality. Closes scylladb/scylladb#25000 * github.com:scylladb/scylladb: sstables: add sstable::make_index_reader() and use where appropriate sstables/mx: in readers, use abstract_index_reader instead of index_reader sstables: in validate(), use abstract_index_reader instead of index_reader where possible test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader sstables/index_reader: introduce abstract_index_reader sstables/index_reader: extract a prefetch_lower_bound() method	2025-07-17 14:32:08 +03:00
Michał Chojnowski	4e4a4b6622	sstables: add sstable::make_index_reader() and use where appropriate If we add multiple index implementations, users of index readers won't easily know which concrete index reader type is the right one to construct. We also don't want pieces of code to depend on functionality specific to certain concrete types, if that's not necessary. So instead of constructing the readers by themselves, they can use a helper function, which will return an abstract (virtual) index reader. This patch adds such a function, as a method of `sstable`.	2025-07-17 10:32:57 +02:00
Nikos Dragazis	09dcdebca3	test: Add tests for Azure Key Provider The tests cover a variety of scenarios, including: * Authentication with client secrets, client certificates, and IMDS. * Valid and invalid encryption options in the configuration and table schema. * Common error conditions such as insufficient permissions, non-existent keys and network errors. All tests run against a local mock server by default. A subset of the tests can also against real Azure services if properly configured. The tests that support real Azure services were kept to a minimum to cover only the most basic scenarios (success path and common error conditions). Running the tests with real resources requires parameterizing them with env vars: * ENABLE_AZURE_TEST - set to non-zero (1/true) to run Azure tests (enabled by default) * ENABLE_AZURE_TEST_REAL - set to non-zero (1/true) to run against real Azure services * AZURE_TENANT_ID - the tenant where the principals live * AZURE_USER_1_CLIENT_ID - the client ID of user1 * AZURE_USER_1_CLIENT_SECRET - the secret of user1 * AZURE_USER_1_CLIENT_CERTIFICATE - the PEM-encoded certificate and private key of user1 * AZURE_USER_2_CLIENT_ID - the client ID of user2 * AZURE_USER_2_CLIENT_SECRET - the secret of user2 * AZURE_USER_2_CLIENT_CERTIFICATE - the PEM-encoded certificate and private key of user2 * AZURE_KEY_NAME - set to <vault_name>/<keyname> User1 is assumed to have permissions to wrap/unwrap using the given key. User2 is assumed to not have permissions for these operations. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 23:06:01 +03:00
Dawid Mędrek	20d0050f4e	cdc: Forbid altering columns of CDC log tables directly The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643	2025-07-16 15:35:48 +02:00
Avi Kivity	c762425ea7	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start	2025-07-16 13:15:54 +03:00
Andrzej Jackowski	b20aa7b5eb	auth: require scheme as parameter for `generate_salt` This is a refactoring commit that changes the `generate_salt` function to require a password hashing scheme as a parameter. This change is motivated by the upcoming removal of support for obsolete password hashing schemes and removal of `identify_best_supported_scheme()` function. Ref. scylladb/scylladb#24524	2025-07-15 20:26:39 +02:00
Botond Dénes	a26b6a3865	Merge 'storage: add `make_data_or_index_source` to the storages' from Ernest Zaslavsky Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance * Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior. * Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage` No backport needed since it enhances functionality which has not been released yet fixes: https://github.com/scylladb/scylladb/issues/22458 Closes scylladb/scylladb#23695 * github.com:scylladb/scylladb: sstables: Start using `make_data_or_index_source` in `sstable` sstables: refactor readers and sources to use coroutines sstables: coroutinize futurized readers sstables: add `make_data_or_index_source` to the `storage` encryption: refactor key retrieval encryption: add `encrypted_data_source` class	2025-07-15 13:32:13 +03:00
Ernest Zaslavsky	8d49bb8af2	sstables: Start using `make_data_or_index_source` in `sstable` Convert all necessary methods to be awaitable. Start using `make_data_or_index_source` when creating data_source for data and index components. For proper working of compressed/checksummed input streams, start passing stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.	2025-07-15 10:10:23 +03:00
Botond Dénes	26f135a55a	Merge 'Make KMIP host do nice TLS close on dropped connection + make PyKMIP test fixure not generate TLS noise + remove boost::process' from Calle Wilund Fixes #24873 In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors. This just makes sure `release` closes the connection if it neither retains or caches it. Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors. Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs. This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency. Closes scylladb/scylladb#24874 * github.com:scylladb/scylladb: encryption_test: Make PyKMIP run under seastar::experimental::process test/lib: Add wrapper helper for test process fixtures kmip_host: Close connections properly if dropped by pool being full encryption_at_rest_test: Do port check using TLS	2025-07-15 06:55:34 +03:00

1 2 3 4 5 ...

4053 Commits