scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Author	SHA1	Message	Date
Benny Halevy	6dbbb80aae	locator: abstract_replication_strategy: implement local_replication_strategy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Note that everywhere_replication_strategy is not abstracted in a similar way, although it could, since the plan is to get rid of it once all system keyspaces areconverted to local or tablets replication (and propagated everywhere if needed using raft group0) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:05:11 +03:00
Benny Halevy	8bde507232	locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently create_effective_replication_map need not know about the internals of vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	8d4ac97435	locator: abstract_replication_map: rename make_effective_replication_map to make_vnode_effective_replication_map_ptr since it is specific to vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	babb4a41a8	locator: abstract_replication_map: rename calculate_effective_replication_map to calculate_vnode_effective_replication_map since it is specific to vnode-based range calculations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	34b223f6f9	replica: database: keyspace: rename {create,update}_effective_replication_map to *_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	688bd4fd43	locator: effective_replication_map_factory: rename create_effective_replication_map to create_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	cbad497859	locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al to static_effective_replication_map_ptr, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	2ab44e871b	locator: abstract_replication_strategy: rename global_vnode_effective_replication_map to global_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:49 +03:00
Benny Halevy	bd62421c05	keyspace: rename get_vnode_effective_replication_map to get_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map (both are per-keyspace). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:40:43 +03:00
Benny Halevy	33f34c8c32	dht: range_streamer: use naked e_r_m pointers Prepare for following patch that will separate the local effective replication map from vnode_effective_replication_map. The caller is responsible to keep the effective_replication_map_ptr alive while in use by low-level async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Benny Halevy	d6d434b1c2	storage_service: use naked e_r_m pointers Prepare for following patch that will separate the local effective replication map from vnode_effective_replication_map. The caller is responsible to keep the effective_replication_map_ptr alive while in use by low-level async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Benny Halevy	59375e4751	alternator: ttl: use naked e_r_m pointers Prepare for following patch that will separate the local effective replication map from vnode_effective_replication_map. The caller is responsible to keep the effective_replication_map_ptr alive while in use by low-level async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Benny Halevy	ec85678de1	locator: abstract_replication_strategy: define is_local Prefer for specializing the local replication strategy, local effective replication map, et. al byt defining an is_local() predicate, similar to uses_tablets(). Note that is_vnode_based() still applies to local replication strategy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Avi Kivity	bb922b2aa9	Merge 'truncate: change check for write during truncate into a log warning' from Ferenc Szili TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail. This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that: - all data written before TRUNCATE starts is deleted - none of the data after TRUNCATE completes is deleted Fixes: #25173 Fixes: #25013 Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1 Closes scylladb/scylladb#25174 * github.com:scylladb/scylladb: truncate: add test for truncate with concurrent writes truncate: change check for write during truncate into a log warning	2025-08-06 00:03:37 +03:00
Pavel Emelyanov	10056a8c6d	Merge 'Simplify credential reload: remove internal expiration checks' from Ernest Zaslavsky This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm. To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does. Fixes: https://github.com/scylladb/scylladb/issues/25044 Should be backported to 2025.3 since we need this fix for the restore Closes scylladb/scylladb#24961 * github.com:scylladb/scylladb: s3_creds: code cleanup s3_creds: Make `reload` unconditional s3_creds: Add test exposing credentials renewal issue	2025-08-05 17:49:13 +03:00
Michael Litvak	faebfdf006	test/cluster/test_tablets_colocation: fix flaky test When restarting the server in the test, wait for it to become ready before requesting tablet repair. Fixes scylladb/scylladb#25261 Closes scylladb/scylladb#25263	2025-08-05 15:36:03 +02:00
Avi Kivity	4c785b31c7	Merge 'List Alternator clients in system.clients virtual table' from Nadav Har'El Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code. Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field. Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator connections, we will list the currently active requests. The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients. Fixes #24993 This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request. Closes scylladb/scylladb#25178 * github.com:scylladb/scylladb: test/cqlpy: slightly strengthen test for system.clients generic_server: use utils::scoped_item_list docs/alternator: document the system.clients system table in Alternator alternator: add test for Alternator clients in system.clients alternator: list active Alternator requests in system.clients utils: unit test for utils::scoped_item_list utils: add a scoped_item_list utility class utils: add "fatal" version of utils::on_internal_error()	2025-08-05 15:55:41 +03:00
Ferenc Szili	33488ba943	truncate: add test for truncate with concurrent writes test_validate_truncate_with_concurrent_writes checks if truncate deletes all the data written before the truncate starts, and does not delete any data after truncate completes.	2025-08-05 13:54:14 +02:00
Pavel Emelyanov	5fcdf948d9	doc: Update system.clients schema with scheduling_group cell It was added by `9319d65971` (db/virtual_tables: add scheduling group column to system.clients) recently. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25294	2025-08-05 10:16:20 +03:00
Artsiom Mishuta	4b975668f6	tiering (test.py): introduce tiering labels introduce tiering marks 1 “unstable” - For unstable tests that will be will continue runing every night and generate up-to-date statistics with failures without failing the “Main” verification path(scylla-ci, Next) 2 “nightly” - for tests that are quite old, stable, and test functionality that rather not be changed or affected by other features, are partially covered in other tests, verify non-critical functionality, have not found any issues or regressions, too long to run on every PR, and can be popped out from the CI run. set 7 long tests(according to statistic in elastic) as nightly(theses 8 tests took 20% of CI run, about 4 hours without paralelization) 1 test as unstable(as exaple ot marker usage) Closes scylladb/scylladb#24974	2025-08-04 15:38:16 +03:00
Ferenc Szili	268ec72dc9	truncate: change check for write during truncate into a log warning TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, the truncated_at timepoint, the offending replay positions which caused the check to fail. Fixes: #25173 Fixes: #25013	2025-08-04 12:24:50 +02:00
Piotr Dulikowski	ec7832cc84	Merge 'Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Patryk Jędrzejczak The following steps are performed in sequence as part of the Raft-based recovery procedure: - set `recovery_leader` to the host ID of the recovery leader in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - perform a rolling restart (with the recovery leader being restarted first). These steps are not intuitive and more complicated than they could be. In this PR, we simplify these steps. From now on, we will be able to simply set `recovery_leader` on each node just before restarting it. Apart from making necessary changes in the code, we also update all tests of the Raft-based recovery procedure and the user-facing documentation. Fixes scylladb/scylladb#25015 The Raft-based procedure was added in 2025.2. This PR makes the procedure simpler and less error-prone, so it should be backported to 2025.2 and 2025.3. Closes scylladb/scylladb#25032 * github.com:scylladb/scylladb: docs: document the option to set recovery_leader later test: delay setting recovery_leader in the recovery procedure tests gossip: add recovery_leader to gossip_digest_syn db: system_keyspace: peers_table_read_fixup: remove rows with null host_id db/config, gms/gossiper: change recovery_leader to UUID db/config, utils: allow using UUID as a config option	2025-08-04 08:29:32 +02:00
Ernest Zaslavsky	837475ec6f	s3_creds: code cleanup Remove unnecessary code which is no more used	2025-08-04 09:26:11 +03:00
Ernest Zaslavsky	e4ebe6a309	s3_creds: Make `reload` unconditional Assume that any caller invoking `reload` intends to refresh credentials. Remove conditional logic that checks for expiration before reloading.	2025-08-03 17:41:35 +03:00
Ernest Zaslavsky	68855c90ca	s3_creds: Add test exposing credentials renewal issue Add a test demonstrating that renewing credentials does not update their expiration. After requesting credentials again, the expiration remains unchanged, indicating no actual update occurred.	2025-08-03 17:41:25 +03:00
Avi Kivity	1c25aa891b	Merge 'storage_proxy.cc: get_cas_shard: fallback to the primary replica shard' from Petr Gusev Currently, `get_cas_shard` uses `sharder.shard_for_reads` to decide which shard to use for LWT execution—both on replicas and the coordinator. If the coordinator is not a replica, `shard_for_reads` returns a default shard (shard 0). There are at least two problems with this: * shard 0 can become overloaded, because all LWT coordinators-but-not-replacas are served on it. * mismatch with replicas: the default shard doesn't match what `shard_for_reads` returns on replicas. This hinders the "same shard for client and server" RPC level optimization. In this PR we change `get_cas_shard` to use a primary replica shard if the current node is not a replica. This guarantees that all LWT coordinators for the same tablet will be served on the same shard. This is important for LWT coordinator locks (`paxos::paxos_state::get_cas_lock`). Also, if all tablet replicas on different nodes live on the same shard, RPC optimization will make sure that no additional `smp::submit_to` will be needed on server side. backport: not needed, since this fix applies only to LWT over tablets, and this feature is not released yet Closes scylladb/scylladb#25224 * github.com:scylladb/scylladb: test_tablets_lwt.py: make tests rf_rack_valid test_tablets_lwt: add test_lwt_coordinator_shard storage_proxy.cc: get_cas_shard: fallback to the primary replica shard sharder: add try_get_shard_for_reads method	2025-08-01 23:07:25 +03:00
Avi Kivity	8b1bf46086	Merge 'sstables: introduce trie_writer' from Michał Chojnowski This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization). Refs scylladb/scylladb#19191 New functionality, no backporting needed. Closes scylladb/scylladb#25154 * github.com:scylladb/scylladb: sstables: introduce trie_writer utils/bit_cast: add object_representation()	2025-08-01 20:23:24 +03:00
Andrei Chekun	c0d652a973	test.py: change boost test stdout to use filehandler instead of pipe With current implementation if pytest will be killed, it will not be able to write the stdout from the boost test. With a new way it should be updated while test executing, instead of writing it the end of the test. Closes scylladb/scylladb#25260	2025-08-01 15:05:00 +03:00
Michał Jadwiszczak	10214e13bd	storage_service, group0_state_machine: move SL cache update from `topology_state_load()` to `load_snapshot()` Currently the service levels cache is unnecessarily updated in every call of `topology_state_load()`. But it is enough to reload it only when a snapshot is loaded. (The cache is also already updated when there is a change to one of `service_levels_v2`, `role_members`, `role_attributes` tables.) Fixes scylladb/scylladb#25114 Fixes scylladb/scylladb#23065 Closes scylladb/scylladb#25116	2025-08-01 13:41:08 +02:00
Nikos Dragazis	2656fca504	test: Use in-memory SQLite for PyKMIP server The PyKMIP server uses an SQLite database to store artifacts such as encryption keys. By default, SQLite performs a full journal and data flush to disk on every CREATE TABLE operation. Each operation triggers three fdatasync(2) calls. If we multiply this by 16, that is the number of tables created by the server, we get a significant number of file syncs, which can last for several seconds on slow machines. This behavior has led to CI stability issues from KMIP unit tests where the server failed to complete its schema creation within the 20-second timeout (observed on spider9 and spider11). Fix this by configuring the server to use an in-memory SQLite. Fixes #24842. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#24995	2025-08-01 12:11:27 +03:00
Nadav Har'El	2431f92967	alternator, test: add reproducer for issue about immediate LWT timeout This patch adds a reproducer for issue #16261, where it was reported that when Alternator read-modify-write (using LWT) operations to the same partition are sent to different nodes, sometimes the operation fails immediately, with an InternalServerError claiming to be a "timeout", although this happens almost immediately (after a few milliseconds), not after any real timeout. The test uses 3 nodes, and 3 threads which send RMW operations to different items in the same partition, and usually (though not with 100% certainty) it reaches the InternalServerError in around 100 writes by each thread. This InternalServerError looks like: Internal server error: exceptions::mutation_write_timeout_exception (Operation timed out for alternator_alternator_Test_1719157066704.alternator_Test_1719157066704 - received only 1 responses from 2 CL=LOCAL_SERIAL.) The test also prints how much time it took for the request to fail, for example: In incrementing 1,0 on node 1: error after 0.017074108123779297 This is 0.017 seconds - it's not the cas_contention_timeout_in_ms timeout (1 second) or any other timeout. If we enable trace logging, adding to topology_experimental_raft/suite.yaml extra_scylla_cmdline_options: ["--logger-log-level", "paxos=trace"] we get the following TRACE-level message in the log: paxos - CAS[0] accept_proposal: proposal is partially rejected This again shows the problem is "uncertainty" (partial rejection) and not a timeout. Refs #16261 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#19445	2025-08-01 11:58:52 +03:00
Aleksandra Martyniuk	e607ef10cd	api: storage_service: do not log the exception that is passed to user The exceptions that are thrown by the tasks started with API are propagated to users. Hence, there is no need to log it. Remove the logs about exception in user started tasks. Fixes: https://github.com/scylladb/scylladb/issues/16732. Closes scylladb/scylladb#25153	2025-08-01 09:49:51 +03:00
Nadav Har'El	edc15a3cf5	test/cqlpy: slightly strengthen test for system.clients We already have a rather rudimentary test for system.clients listing CQL connections. However, as written the test will pass if system.clients is empty :-) So let's strengthen the test to verify that there must be at least one CQL connection listed in system.clients. Indeed, the test runs the "SELECT FROM system.clients" over one CQL connection, so surely that connection must be present. This patch doesn't strengthen this test in any other way - it still has just one connection, not many, it still doesn't validate the values of most of the columns, and it is still written to assume the Scylla server is running on localhost and not running any other workload in parallel. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:32:19 +03:00
Nadav Har'El	ce0ee27422	generic_server: use utils::scoped_item_list A previous patch introduced utils::scoped_item_list, which maintains a list of items - such as a list of ongoing connections - automatically removing the item from the list when its handle is destroyed. The list can also be iterated "gently" (without risking stalls when the list is long). The implementation of this class was based on very similar code in generic_server.hh / generic_server.cc. So in this patch we change generic_server use the new scoped_item_list, and drop its own copy of the duplicated logic of maintaining the list and iterating gently over it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:32:14 +03:00
Nadav Har'El	70c94ac9dd	docs/alternator: document the system.clients system table in Alternator Add to docs/alternator/new-apis.md a full description of the `system.clients` support in Alternator that was added in the previous patches. Although arguably all Scylla system tables should work on Alternator and do not need to be individually documented, I believe that this specific table, is interesting to document. This is because some of the attributes in this table have non-obvious and Alternator-specific meanings. Moreover, there's even a diffence in what each individual item in the table represents (it represents active requests, not entire connections as in CQL). While editing the system tables section of new-apis.md, this patch also slightly improves its formatting. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:05 +03:00
Nadav Har'El	5baa4c40fd	alternator: add test for Alternator clients in system.clients This patch adds a regression test for the feature added in the previous patch, i.e that the system.clients virtual table also lists ongoing Alternator request. The new test reads the system.clients system table using an Alternator Scan request, so it should see its own request - at least - in the result. It verifies that it sees Alternator requests (at least one), and that these requests have the expected fields set, and for a couple of fields, we even know which value to expect (the "client_type" field is "alternator", and the "ssl_enabled" field depends on whether the test is checking an http:// or https:// URL (you can try both in test/alternator/run - by using or not using the "--https" parameter). The new test fails before the previous patch (because system.clients will not list any Alternator connection), and passes after it. As all tests in test_system_tables.py for Scylla-specific system tables, this test is marked scylla_only and skipped when running on AWS DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:05 +03:00
Nadav Har'El	c14b9c5812	alternator: list active Alternator requests in system.clients Today, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. In this patch we make Alternator active clients also be listed on this virtual table. Unlike CQL where logged in username applies to a complete connection, in the Alternator API, different requests, theoretically signed by different users, can arrive over the same HTTP connection. So instead of listing the currently open connections, we list the currently active requests. This means that when scanning system.clients, you will only see requests which are being handled right now - and not inactive HTTP connections. I think this good enough (besides being the correct thing to do) - one of the goals of this system.clients is to be able to see what kind of drivers are being used by the user (the "driver_name" field in the system.clients) - on a busy server there will always be some (even many) requests being handled, so we'll always have plenty of requests to see in system.clients. By the way, note that for Alternator requests, what we use for the "driver_name" is the request's User-Agent header. AWS SDKs typically write the driver's name, its version, and often a lot of other information in that header. For example, Boto3 sends a User-Agent looking like: Boto3/1.38.46 md/Botocore#1.38.46 md/awscrt#0.24.2 ua/2.1 os/linux#6.15.4-100.fc41.x86_64 md/arch#x86_64 lang/python#3.13.5 md/pyimpl#CPython m/N,P,b,D,Z cfg/retry-mode#legacy Botocore/1.38.46 Resource A functional test for the new feature - adding Alternator requests to the system.clients table - will be in the next patch. Fixes #24993 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:05 +03:00
Nadav Har'El	20b31987e1	utils: unit test for utils::scoped_item_list The previous test introduced a new utility class, utils::scoped_item_list. This patch adds a comprehensive unit test for the new class. We test basic usage of scoped_item_list, its size() and empty() methods, how items are removed from the list when their handle goes out of scope, how a handle's move constructor works, how items can be read and written through their handles, and finally that removing an item during a for_each_gently() iteration doesn't break the iteration. One thing I still didn't figure out how to properly test is how removing an item during multiple iterations that run concurrently fixes multiple iterators. I believe the code is correct there (we just have a list of ongoing iterations - instead of just one), but haven't found yet a way to reproduce this situation in a test. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Nadav Har'El	186e6d3ce0	utils: add a scoped_item_list utility class In a later patch, we'll want Alternator to maintain a list of ongoing requests, and be able to list them when the system.clients table is read. This patch introduces a new container, utils::scoped_item_list<T>, that will help Alternator do that: 1. Each request adds an item to the list, and receives a handle; When that handle goes out of scope the item is automatically deleted from the list. 2. Also a method is provided for iterating over the list of items without risking a stall if the list is very long. The new scoped_item_list<T> is heavily based on similar code that is integrated inside generic_server.hh, which is used by CQL to similarly maintain a list of active connections and their properties. However, unfortunately that code is deeply integrated into the generic_server class, and Alternator can't use generic_server because it uses Seastar's HTTP server which isn't based on generic_server. In contrast, the container defined in this patch is stand-alone and does not depend on Alternator in any way. In a later patch in this series we will modify generic_server to use the new scoped_item_list<> instead of having that feature inside it. The next patch is a unit test for the new class we are adding in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Nadav Har'El	33476c7b06	utils: add "fatal" version of utils::on_internal_error() utils::on_internal_error() is a wrapper for Seastar's on_internal_error() which does not require a logger parameter - because it always uses one logger ("on_internal_error"). Not needing a unique logger is especially important when using on_internal_error() in a header file, where we can't define a logger. Seastar also has a another similar function, on_fatal_internal_error(), for which we forgot to implement a "utils" version (without a logger parameter). This patch fixes that oversight. In the next patch, we need to use on_fatal_internal_error() in a header file, so the "utils" version will be useful. We will need the fatal version because we will encounter an unexpected situation during server destruction, and if we let the regular on_internal_error() just throw an exception, we'll be left in an undefined state. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Patryk Jędrzejczak	e53dc7ca86	Merge 'remove unused function and simplify some qp code.' from Gleb Natapov No backport needed since these are cleanups. Closes scylladb/scylladb#25258 * https://github.com/scylladb/scylladb: qp: fold prepare_one function into its only caller qp: co-routinize prepare_one function cql3: drop unused function	2025-07-31 18:19:47 +02:00
Taras Veretilnyk	1d6808aec4	topology_coordinator: Make tablet_load_stats_refresh_interval configurable This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds' that allows overriding the default value without using error injection. Fixes scylladb/scylladb#24641 Closes scylladb/scylladb#24746	2025-07-31 14:31:55 +03:00
Gleb Natapov	041011b2ee	qp: fold prepare_one function into its only caller	2025-07-31 14:12:34 +03:00
Gleb Natapov	715f1d994f	qp: co-routinize prepare_one function	2025-07-31 14:11:17 +03:00
Michał Chojnowski	c8682af418	sstables: introduce trie_writer This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).	2025-07-31 12:51:37 +02:00
Calle Wilund	43f7eecf9e	compress: move compress.cc/hh to sstables/compressor Fixes #22106 Moves the shared compress components to sstables, and rename to match class type. Adjust includes, removing redundant/unneeded ones where possible. Closes scylladb/scylladb#25103	2025-07-31 13:10:41 +03:00
Pavel Emelyanov	34608450c5	Merge 'qos: don't populate effective service level cache until auth is migrated to raft' from Piotr Dulikowski Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart). Closes scylladb/scylladb#25188 * github.com:scylladb/scylladb: test: sl: verify that legacy auth is not queried in sl to raft upgrade qos: don't populate effective service level cache until auth is migrated to raft	2025-07-31 13:05:27 +03:00
Botond Dénes	7e27157664	replica/table: add_sstables_and_update_cache(): remove error log The plural overload of this method logs an error when the sstable add fails. This is unnecessary, the caller is expected to catch and handle exceptions. Furthermore, this unconditional error log results in sporadic test failures, due to the unexpected error in the logs on shutdown. Fixes: #24850 Closes scylladb/scylladb#25235	2025-07-31 12:34:40 +03:00
Petr Gusev	3500a10197	scylla_cluster.py: add try_get_host_id Tests sometimes fail in ScyllaCluster.add_server on the 'replaced_srv.host_id' line because host_id is not resolved yet. In this commit we introduce functions try_get_host_id and get_host_id that resolve it when needed. Closes scylladb/scylladb#25177	2025-07-31 10:37:06 +02:00
Patryk Jędrzejczak	c41f0e6da9	Merge 'generic server: 2 step shutdown' from Sergey Zolotukhin This PR implements solution proposed in scylladb/scylladb#24481 Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections. The updated shutdown process is as follows: 1. Initial Shutdown Phase * Close the accept gate to block new incoming connections. * Abort all accept() calls. * For all active connections: * Close only the input side of the connection to prevent new requests. * Keep the output side open to allow responses to be sent. 2. Drain Phase * Wait for all in-progress requests to either complete or fail. 3. Final Shutdown Phase * Fully close all connections. Fixes scylladb/scylladb#24481 Closes scylladb/scylladb#24499 * https://github.com/scylladb/scylladb: test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout. generic_server: Two-step connection shutdown. transport: consmetic change, remove extra blanks. transport: Handle sleep aborted exception in sleep_until_timeout_passes generic_server: replace empty destructor with `= default` generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output` generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class. test: Add test for query execution during CQL server shutdown	2025-07-31 10:32:30 +02:00

1 2 3 4 5 ...

48781 Commits