scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 02:20:37 +00:00

Author	SHA1	Message	Date
copilot-swe-agent[bot]	7dc72ff9a5	Fix unsigned integer underflow in item length calculation Co-authored-by: nyh <584227+nyh@users.noreply.github.com>	2025-08-10 11:46:43 +00:00
copilot-swe-agent[bot]	31c79f6cd7	Fix critical bugs in base64 implementation - array bounds and substr length Co-authored-by: nyh <584227+nyh@users.noreply.github.com>	2025-08-10 11:41:57 +00:00
copilot-swe-agent[bot]	5c7d63da6a	Fix critical bugs in alternator get_magnitude_and_precision function Co-authored-by: nyh <584227+nyh@users.noreply.github.com>	2025-08-10 11:38:22 +00:00
Botond Dénes	70aa81990b	Merge 'Alternator - add the ability to write, not just read, system tables' from Nadav Har'El In commit `44a1daf` we added the ability to read Scylla system tables with Alternator. This feature is useful, among other things, in tests that want to read Scylla's configuration through the system table system.config. But tests often want to modify system.config, e.g., to temporarily reduce some threshold to make tests shorter. Until now, this was not possible This series add supports for writing to system tables through Alternator, and examples of tests using this capability (and utility functions to make it easy). Because the ability to write to system tables may have non-obvious security consequences, it is turned off by default and needs to be enabled with a new configuration option "alternator_allow_system_table_write" No backports are necessary - this feature is only intended for tests. We may later decide to backport if we want to backport new tests, but I think the probability we'll want to do this is low. Fixes #12348 Closes scylladb/scylladb#19147 * github.com:scylladb/scylladb: test/alternator: utility functions for changing configuration alternator: add optional support for writing to system table test/alternator: reduce duplicated code	2025-08-08 09:13:15 +03:00
Benny Halevy	3f44dba014	sstables: make_entry_descriptor: make regex non-greedy With greedy matching, an sstable path in a snapshot directory with a tag that resembles a name-<uuid> would match the dir regular expression as the longest match, while a non-greedy regular expression would correctly match the real keyspace and table as the shortest match. Also, add a regression unit test reproducing the issue and validating the fix. Fixes #25242 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25323	2025-08-07 15:35:11 +03:00
Avi Kivity	8164f72f6e	Merge 'Separate local_effective_replication_map from vnode_effective_replication_map' from Benny Halevy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Refs #22733 * No backport required Closes scylladb/scylladb#25222 * github.com:scylladb/scylladb: locator: abstract_replication_strategy: implement local_replication_strategy locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently locator: abstract_replication_map: rename make_effective_replication_map locator: abstract_replication_map: rename calculate_effective_replication_map replica: database: keyspace: rename {create,update}_effective_replication_map locator: effective_replication_map_factory: rename create_effective_replication_map locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al locator: abstract_replication_strategy: rename global_vnode_effective_replication_map keyspace: rename get_vnode_effective_replication_map dht: range_streamer: use naked e_r_m pointers storage_service: use naked e_r_m pointers alternator: ttl: use naked e_r_m pointers locator: abstract_replication_strategy: define is_local	2025-08-07 12:51:43 +03:00
Nadav Har'El	6f415b2f10	Merge 'test/cqlpy: Adjust test_describe.py to work against Cassandra' from Dawid Mędrek We adjust most of the tests in `cqlpy/test_describe.py` so that they work against both Scylla and Cassandra. This PR doesn't cover all of them, just those I authored. Refs scylladb/scylladb#11690 Backport: not needed. This is effectively a code cleanup. Closes scylladb/scylladb#25060 * github.com:scylladb/scylladb: test/cqlpy/test_describe.py: Adjust test_create_role_with_hashed_password_authorization to work with Cassandra test/cqlpy/test_describe.py: Adjust test_desc_restore to work with Cassandra test/cqlpy/test_describe.py: Mark Scylla-only tests as such	2025-08-07 12:43:04 +03:00
Avi Kivity	90eb6e6241	Merge 'sstables/trie: implement BTI node format serialization and traversal' from Michał Chojnowski This is the next part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25154 Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here. The new code added here is not used for anything yet, but it's posted as a separate PR to keep things reviewably small. This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes. It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR) into the on-disk format, and the logic for traversing the on-disk nodes during a read. New functionality, no backporting needed. Closes scylladb/scylladb#25317 * github.com:scylladb/scylladb: sstables/trie: add tests for BTI node serialization and traversal sstables/trie: implement BTI node traversal sstables/trie: implement BTI serialization utils/cached_file: add get_shared_page() utils/cached_file: replace a std::pair with a named struct	2025-08-07 12:15:42 +03:00
Nadav Har'El	d632599a92	Merge 'test.py: native pytest repeats' from Andrei Chekun Previous way of execution repeat was to launch pytest for each repeat. That was resource consuming, since each time pytest was doing discovery of the tests. Now all repeats are done inside one pytest process. Backport for 2025.3 is needed, since this functionality is framework only, and 2025.3 affected with this slow repeats as well. Closes scylladb/scylladb#25073 * github.com:scylladb/scylladb: test.py: add repeats in pytest test.py: add directories and filename to the log files test.py: rename log sink file for boost tests test.py: better error handling in boost facade	2025-08-06 18:18:03 +03:00
Benny Halevy	5e5e63af10	scylla-sstable: print_query_results_json: continue loop if row is disengaged Otherwise it is accessed right when exiting the if block. Add a unit test reproducing the issue and validating the fix. Fixes #25325 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25326	2025-08-06 16:44:51 +03:00
Szymon Malewski	eb11485969	test/alternator: enable more relevant logs in CI. This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%. This affects running alternator tests only with `test.py`, not with `test/alternator/run`. Closes #24645 Closes scylladb/scylladb#25327	2025-08-06 16:37:25 +03:00
Benny Halevy	6dbbb80aae	locator: abstract_replication_strategy: implement local_replication_strategy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Note that everywhere_replication_strategy is not abstracted in a similar way, although it could, since the plan is to get rid of it once all system keyspaces areconverted to local or tablets replication (and propagated everywhere if needed using raft group0) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:05:11 +03:00
Benny Halevy	babb4a41a8	locator: abstract_replication_map: rename calculate_effective_replication_map to calculate_vnode_effective_replication_map since it is specific to vnode-based range calculations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	cbad497859	locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al to static_effective_replication_map_ptr, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	bd62421c05	keyspace: rename get_vnode_effective_replication_map to get_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map (both are per-keyspace). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:40:43 +03:00
Pavel Emelyanov	0616407be5	Merge 'rest_api: add endpoint which drops all quarantined sstables' from Taras Veretilnyk Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API. This endpoint allows dropping all quarantined SSTables either globally or for a specific keyspace and tables. Optional query parameters `keyspace` and `tables` (comma-separated table names) can be provided to limit the scope of the operation. Fixes scylladb/scylladb#19061 Backport is not required, it is new functionality Closes scylladb/scylladb#25063 * github.com:scylladb/scylladb: docs: Add documentation for the nodetool dropquarantinedsstables command nodetool: add command for dropping quarantine sstables rest_api: add endpoint which drops all quarantined sstables	2025-08-06 11:55:15 +03:00
Nadav Har'El	10588958e0	test/alternator: add regression test for keep-alive support An Alternator user complained about suspiciously many new connections being opened, which raised a suspicion that maybe Alternator doesn't support HTTP and HTTPS keep-alive (allowing a client to reuse the same connection for multiple requests). It turns out that we never had a regression test that this feature actually works (and doesn't break), so this patch adds one. The test confirms that Alternator's connection reuse (keep-alive) feature actually works correctly. Of course, only if the driver really tries to reuse a connection - which is a separate question and needs testing on the driver side (scylladb/alternator-load-balancing#82). The test sends two requests using Python's "requests" library which can normally reuse connections (it uses a "connection pool"), and checks if the connection was really reused. Unfortunately "requests" doesn't give us direct knowledge of whether or not it reused a connection, so we check this using simple monkey-patching. I actually tried multiple other approaches before settling on this one. The approach needs to work on both HTTP and HTTPS, and also on AWS DynamoDB. Importantly, the test checks both keep-alive and non-keep-alive cases. This is very important for validating the test itself and its tricky monkey-patching code: The test is meant to detect when the socket is not reused for the second request, so we want to also check the non-keep- alive case where we know the socket isn't reused, to see the test code really detected this situation. By default, this test runs (like all of Alternator's test suite) on HTTP sockets. Running this test with "test/alternator/run --https" will run it on HTTPS sockets. The test currently passes on both HTTP and HTTPS. It also passes on AWS DynamoDB ("test/alternator/run --aws") Fixes #23067 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25202	2025-08-06 11:41:21 +03:00
Karol Nowacki	032e8f9030	test/boost/vector_store_client_test.cc: Fix flaky tests The vector_store_client_test was observed to be flaky, sometimes hanging while waiting for a response from HTTP server. Problem: The default load balancing algorithm (in Seastar's posix_server_socket_impl::accept) could route an incoming connection to a different shard than the one executing the test. Because the HTTP server is a non-sharded service running only on the test's originating shard, any connection submitted to another shard would never be handled, causing the test client to hang waiting for response. Solution: The patch resolves the issue by explicitly setting fixed cpu load balancing algorithm. This ensures that incoming connections are always handled on the same shard where the HTTP server is running. Closes scylladb/scylladb#25314	2025-08-06 11:24:51 +03:00
Nadav Har'El	fa86405b1f	test/alternator: utility functions for changing configuration Now that the previous patch made it possible to write to system tables in Alternator tests, this patch introduces utility functions for changing the configuration - scylla_config_write() in addition to the scylla_config_read() we already had, and scylla_config_temporary() to temporarily change a configurable parameter and then restore it to its old value. This patch adds a silly test that temporarily modifies the query_tombstone_page_limit configuration parameter. Later we can add more tests that use the new test functions for more "serious" testing of real features. In particular, we don't have an Alternator test for the max_concurrent_requests_per_shard configuration - and I want to write one. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-06 10:02:24 +03:00
Nadav Har'El	a896e2dbb9	alternator: add optional support for writing to system table In commit `44a1daf` we added the ability to read system tables through the DynamoDB API (actually, the Scan and Query requests only). This ability is useful for tests, and can also be useful to users who want to read information that is only available through system tables. This patch adds support also for writing into system tables. This will be useful for Alternator tests, were we want to temporarily change some live-updatable configuration option - and so far haven't been able to do that like we did do in some cql-pytest tests. For reasons explained in issue #23218, only superuser roles are allowed to write to system tables - it is not enough for the role to be granted MODIFY permissions on the system table or on ALL KEYSPACES. Moreover, the ability to modify system tables carries special risks, so this patch only allows writes to the system tables if a new configuration option "alternator_allow_system_table_write" turned on. This option is turned off by default. This patch also includes a test for this new configuration-writing capability. The test scripts test/alternator/run and test.py now run Scylla with alternator_allow_system_table_write turned on, but the new test can also run without this option, and will be skipped in that case (to allow running the test suite against some manually- run instance of Scylla). Fixes: #12348 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-06 10:00:04 +03:00
Nadav Har'El	5913498fff	test/alternator: reduce duplicated code Four tests had almost identical code to read an item from Scylla configuration (using the system.config system table). It's time to make this into a new utility function, scylla_config_read(). This is a good time to do it, because in a later patch I want to also add a similar function to write into the configuration. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-06 09:56:47 +03:00
Nadav Har'El	d46dda0840	Merge 'cql, vector_search: implement read path' from null This pull request is an addition of ANN OF queries. The patch contains: - CQL syntax for ORDER BY `vector_column_name` ANN OF `vector_literal` clause of SELECT statements. - implementation of external ANN queries (using vector-store service) - tests Example syntax: ``` SELECT comment FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 3; ``` Limit can be between 1 and 1000 - same as for Cassandra. Co-authored-by: @janpiotrlakomy @smoczy123 Fixes: VECTOR-48 Fixes: VECTOR-46 Closes scylladb/scylladb#24444 * github.com:scylladb/scylladb: cql3/statements: implement external `ANN OF` queries vector_store_client: implement ann_error_visitor test/cqlpy: check ANN queries disallow filtering properly cassandra_tests: translate vector_invalid_query_test cassandra_tests: copy vector_invalid_query_test from Cassandra vector_index: make parameter names case insensitive cql3/statements: add `ANN OF` queries support to select statements cql/Cql.g: extend the grammar to allow for `ANN OF` queries cql3/raw: add ANN ordering to the raw statement layer	2025-08-06 09:53:38 +03:00
Avi Kivity	bb922b2aa9	Merge 'truncate: change check for write during truncate into a log warning' from Ferenc Szili TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail. This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that: - all data written before TRUNCATE starts is deleted - none of the data after TRUNCATE completes is deleted Fixes: #25173 Fixes: #25013 Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1 Closes scylladb/scylladb#25174 * github.com:scylladb/scylladb: truncate: add test for truncate with concurrent writes truncate: change check for write during truncate into a log warning	2025-08-06 00:03:37 +03:00
Michał Chojnowski	9930cd59eb	sstables/trie: add tests for BTI node serialization and traversal Adds tests which check that nodes serialized by `bti_node_sink` are readable by `bti_node_reader` with the right result. (Note: there are no tests which check compatibility of the encoded nodes with Cassandra or with handwritten hexdumps. There are only tests for mutual compatibility between Scylla's writers and readers. This can be considered a gap in testing.)	2025-08-05 21:48:24 +02:00
Pavel Emelyanov	10056a8c6d	Merge 'Simplify credential reload: remove internal expiration checks' from Ernest Zaslavsky This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm. To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does. Fixes: https://github.com/scylladb/scylladb/issues/25044 Should be backported to 2025.3 since we need this fix for the restore Closes scylladb/scylladb#24961 * github.com:scylladb/scylladb: s3_creds: code cleanup s3_creds: Make `reload` unconditional s3_creds: Add test exposing credentials renewal issue	2025-08-05 17:49:13 +03:00
Michael Litvak	faebfdf006	test/cluster/test_tablets_colocation: fix flaky test When restarting the server in the test, wait for it to become ready before requesting tablet repair. Fixes scylladb/scylladb#25261 Closes scylladb/scylladb#25263	2025-08-05 15:36:03 +02:00
Avi Kivity	4c785b31c7	Merge 'List Alternator clients in system.clients virtual table' from Nadav Har'El Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code. Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field. Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator connections, we will list the currently active requests. The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients. Fixes #24993 This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request. Closes scylladb/scylladb#25178 * github.com:scylladb/scylladb: test/cqlpy: slightly strengthen test for system.clients generic_server: use utils::scoped_item_list docs/alternator: document the system.clients system table in Alternator alternator: add test for Alternator clients in system.clients alternator: list active Alternator requests in system.clients utils: unit test for utils::scoped_item_list utils: add a scoped_item_list utility class utils: add "fatal" version of utils::on_internal_error()	2025-08-05 15:55:41 +03:00
Ferenc Szili	33488ba943	truncate: add test for truncate with concurrent writes test_validate_truncate_with_concurrent_writes checks if truncate deletes all the data written before the truncate starts, and does not delete any data after truncate completes.	2025-08-05 13:54:14 +02:00
Dawid Pawlik	74f603fe99	test/cqlpy: check ANN queries disallow filtering properly Add tests checking if filtering with clustering column or using index is disallowed while performing ANN query.	2025-08-05 12:34:48 +02:00
Artsiom Mishuta	4b975668f6	tiering (test.py): introduce tiering labels introduce tiering marks 1 “unstable” - For unstable tests that will be will continue runing every night and generate up-to-date statistics with failures without failing the “Main” verification path(scylla-ci, Next) 2 “nightly” - for tests that are quite old, stable, and test functionality that rather not be changed or affected by other features, are partially covered in other tests, verify non-critical functionality, have not found any issues or regressions, too long to run on every PR, and can be popped out from the CI run. set 7 long tests(according to statistic in elastic) as nightly(theses 8 tests took 20% of CI run, about 4 hours without paralelization) 1 test as unstable(as exaple ot marker usage) Closes scylladb/scylladb#24974	2025-08-04 15:38:16 +03:00
Piotr Dulikowski	ec7832cc84	Merge 'Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Patryk Jędrzejczak The following steps are performed in sequence as part of the Raft-based recovery procedure: - set `recovery_leader` to the host ID of the recovery leader in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - perform a rolling restart (with the recovery leader being restarted first). These steps are not intuitive and more complicated than they could be. In this PR, we simplify these steps. From now on, we will be able to simply set `recovery_leader` on each node just before restarting it. Apart from making necessary changes in the code, we also update all tests of the Raft-based recovery procedure and the user-facing documentation. Fixes scylladb/scylladb#25015 The Raft-based procedure was added in 2025.2. This PR makes the procedure simpler and less error-prone, so it should be backported to 2025.2 and 2025.3. Closes scylladb/scylladb#25032 * github.com:scylladb/scylladb: docs: document the option to set recovery_leader later test: delay setting recovery_leader in the recovery procedure tests gossip: add recovery_leader to gossip_digest_syn db: system_keyspace: peers_table_read_fixup: remove rows with null host_id db/config, gms/gossiper: change recovery_leader to UUID db/config, utils: allow using UUID as a config option	2025-08-04 08:29:32 +02:00
Ernest Zaslavsky	e4ebe6a309	s3_creds: Make `reload` unconditional Assume that any caller invoking `reload` intends to refresh credentials. Remove conditional logic that checks for expiration before reloading.	2025-08-03 17:41:35 +03:00
Ernest Zaslavsky	68855c90ca	s3_creds: Add test exposing credentials renewal issue Add a test demonstrating that renewing credentials does not update their expiration. After requesting credentials again, the expiration remains unchanged, indicating no actual update occurred.	2025-08-03 17:41:25 +03:00
Avi Kivity	1c25aa891b	Merge 'storage_proxy.cc: get_cas_shard: fallback to the primary replica shard' from Petr Gusev Currently, `get_cas_shard` uses `sharder.shard_for_reads` to decide which shard to use for LWT execution—both on replicas and the coordinator. If the coordinator is not a replica, `shard_for_reads` returns a default shard (shard 0). There are at least two problems with this: * shard 0 can become overloaded, because all LWT coordinators-but-not-replacas are served on it. * mismatch with replicas: the default shard doesn't match what `shard_for_reads` returns on replicas. This hinders the "same shard for client and server" RPC level optimization. In this PR we change `get_cas_shard` to use a primary replica shard if the current node is not a replica. This guarantees that all LWT coordinators for the same tablet will be served on the same shard. This is important for LWT coordinator locks (`paxos::paxos_state::get_cas_lock`). Also, if all tablet replicas on different nodes live on the same shard, RPC optimization will make sure that no additional `smp::submit_to` will be needed on server side. backport: not needed, since this fix applies only to LWT over tablets, and this feature is not released yet Closes scylladb/scylladb#25224 * github.com:scylladb/scylladb: test_tablets_lwt.py: make tests rf_rack_valid test_tablets_lwt: add test_lwt_coordinator_shard storage_proxy.cc: get_cas_shard: fallback to the primary replica shard sharder: add try_get_shard_for_reads method	2025-08-01 23:07:25 +03:00
Avi Kivity	8b1bf46086	Merge 'sstables: introduce trie_writer' from Michał Chojnowski This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization). Refs scylladb/scylladb#19191 New functionality, no backporting needed. Closes scylladb/scylladb#25154 * github.com:scylladb/scylladb: sstables: introduce trie_writer utils/bit_cast: add object_representation()	2025-08-01 20:23:24 +03:00
Andrei Chekun	c0d652a973	test.py: change boost test stdout to use filehandler instead of pipe With current implementation if pytest will be killed, it will not be able to write the stdout from the boost test. With a new way it should be updated while test executing, instead of writing it the end of the test. Closes scylladb/scylladb#25260	2025-08-01 15:05:00 +03:00
Jan Łakomy	8b2ed0f014	cassandra_tests: translate vector_invalid_query_test Translate vector_invalid_query_test which tests parsing of ANN OF syntax. Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>	2025-08-01 12:08:50 +02:00
Jan Łakomy	eec47d9059	cassandra_tests: copy vector_invalid_query_test from Cassandra Copy over and comment out this tests code from Cassandra for it to be translated later.	2025-08-01 12:08:50 +02:00
Nikos Dragazis	2656fca504	test: Use in-memory SQLite for PyKMIP server The PyKMIP server uses an SQLite database to store artifacts such as encryption keys. By default, SQLite performs a full journal and data flush to disk on every CREATE TABLE operation. Each operation triggers three fdatasync(2) calls. If we multiply this by 16, that is the number of tables created by the server, we get a significant number of file syncs, which can last for several seconds on slow machines. This behavior has led to CI stability issues from KMIP unit tests where the server failed to complete its schema creation within the 20-second timeout (observed on spider9 and spider11). Fix this by configuring the server to use an in-memory SQLite. Fixes #24842. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#24995	2025-08-01 12:11:27 +03:00
Nadav Har'El	2431f92967	alternator, test: add reproducer for issue about immediate LWT timeout This patch adds a reproducer for issue #16261, where it was reported that when Alternator read-modify-write (using LWT) operations to the same partition are sent to different nodes, sometimes the operation fails immediately, with an InternalServerError claiming to be a "timeout", although this happens almost immediately (after a few milliseconds), not after any real timeout. The test uses 3 nodes, and 3 threads which send RMW operations to different items in the same partition, and usually (though not with 100% certainty) it reaches the InternalServerError in around 100 writes by each thread. This InternalServerError looks like: Internal server error: exceptions::mutation_write_timeout_exception (Operation timed out for alternator_alternator_Test_1719157066704.alternator_Test_1719157066704 - received only 1 responses from 2 CL=LOCAL_SERIAL.) The test also prints how much time it took for the request to fail, for example: In incrementing 1,0 on node 1: error after 0.017074108123779297 This is 0.017 seconds - it's not the cas_contention_timeout_in_ms timeout (1 second) or any other timeout. If we enable trace logging, adding to topology_experimental_raft/suite.yaml extra_scylla_cmdline_options: ["--logger-log-level", "paxos=trace"] we get the following TRACE-level message in the log: paxos - CAS[0] accept_proposal: proposal is partially rejected This again shows the problem is "uncertainty" (partial rejection) and not a timeout. Refs #16261 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#19445	2025-08-01 11:58:52 +03:00
Nadav Har'El	edc15a3cf5	test/cqlpy: slightly strengthen test for system.clients We already have a rather rudimentary test for system.clients listing CQL connections. However, as written the test will pass if system.clients is empty :-) So let's strengthen the test to verify that there must be at least one CQL connection listed in system.clients. Indeed, the test runs the "SELECT FROM system.clients" over one CQL connection, so surely that connection must be present. This patch doesn't strengthen this test in any other way - it still has just one connection, not many, it still doesn't validate the values of most of the columns, and it is still written to assume the Scylla server is running on localhost and not running any other workload in parallel. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:32:19 +03:00
Nadav Har'El	5baa4c40fd	alternator: add test for Alternator clients in system.clients This patch adds a regression test for the feature added in the previous patch, i.e that the system.clients virtual table also lists ongoing Alternator request. The new test reads the system.clients system table using an Alternator Scan request, so it should see its own request - at least - in the result. It verifies that it sees Alternator requests (at least one), and that these requests have the expected fields set, and for a couple of fields, we even know which value to expect (the "client_type" field is "alternator", and the "ssl_enabled" field depends on whether the test is checking an http:// or https:// URL (you can try both in test/alternator/run - by using or not using the "--https" parameter). The new test fails before the previous patch (because system.clients will not list any Alternator connection), and passes after it. As all tests in test_system_tables.py for Scylla-specific system tables, this test is marked scylla_only and skipped when running on AWS DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:05 +03:00
Nadav Har'El	20b31987e1	utils: unit test for utils::scoped_item_list The previous test introduced a new utility class, utils::scoped_item_list. This patch adds a comprehensive unit test for the new class. We test basic usage of scoped_item_list, its size() and empty() methods, how items are removed from the list when their handle goes out of scope, how a handle's move constructor works, how items can be read and written through their handles, and finally that removing an item during a for_each_gently() iteration doesn't break the iteration. One thing I still didn't figure out how to properly test is how removing an item during multiple iterations that run concurrently fixes multiple iterators. I believe the code is correct there (we just have a list of ongoing iterations - instead of just one), but haven't found yet a way to reproduce this situation in a test. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Taras Veretilnyk	1d6808aec4	topology_coordinator: Make tablet_load_stats_refresh_interval configurable This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds' that allows overriding the default value without using error injection. Fixes scylladb/scylladb#24641 Closes scylladb/scylladb#24746	2025-07-31 14:31:55 +03:00
Michał Chojnowski	c8682af418	sstables: introduce trie_writer This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).	2025-07-31 12:51:37 +02:00
Calle Wilund	43f7eecf9e	compress: move compress.cc/hh to sstables/compressor Fixes #22106 Moves the shared compress components to sstables, and rename to match class type. Adjust includes, removing redundant/unneeded ones where possible. Closes scylladb/scylladb#25103	2025-07-31 13:10:41 +03:00
Pavel Emelyanov	34608450c5	Merge 'qos: don't populate effective service level cache until auth is migrated to raft' from Piotr Dulikowski Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart). Closes scylladb/scylladb#25188 * github.com:scylladb/scylladb: test: sl: verify that legacy auth is not queried in sl to raft upgrade qos: don't populate effective service level cache until auth is migrated to raft	2025-07-31 13:05:27 +03:00
Petr Gusev	3500a10197	scylla_cluster.py: add try_get_host_id Tests sometimes fail in ScyllaCluster.add_server on the 'replaced_srv.host_id' line because host_id is not resolved yet. In this commit we introduce functions try_get_host_id and get_host_id that resolve it when needed. Closes scylladb/scylladb#25177	2025-07-31 10:37:06 +02:00
Patryk Jędrzejczak	c41f0e6da9	Merge 'generic server: 2 step shutdown' from Sergey Zolotukhin This PR implements solution proposed in scylladb/scylladb#24481 Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections. The updated shutdown process is as follows: 1. Initial Shutdown Phase * Close the accept gate to block new incoming connections. * Abort all accept() calls. * For all active connections: * Close only the input side of the connection to prevent new requests. * Keep the output side open to allow responses to be sent. 2. Drain Phase * Wait for all in-progress requests to either complete or fail. 3. Final Shutdown Phase * Fully close all connections. Fixes scylladb/scylladb#24481 Closes scylladb/scylladb#24499 * https://github.com/scylladb/scylladb: test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout. generic_server: Two-step connection shutdown. transport: consmetic change, remove extra blanks. transport: Handle sleep aborted exception in sleep_until_timeout_passes generic_server: replace empty destructor with `= default` generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output` generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class. test: Add test for query execution during CQL server shutdown	2025-07-31 10:32:30 +02:00
Nadav Har'El	78c10af960	test/cqlpy: add reproducer for INSERT JSON .. IF NOT EXISTS bug This patch adds an xfailing test reproducing a bug where when adding an IF NOT EXISTS to a INSERT JSON statement, the IF NOT EXISTS is ignored. This bug has been known for 4 years (issue #8682) and even has a FIXME referring to it in cql3/statements/update_statement.cc, but until now we didn't have a reproducing test. The tests in this patch also show that this bug is specific to INSERT JSON - regular INSERT works correctly - and also that Cassandra works correctly (and passes the test). Refs #8682 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25244	2025-07-30 20:14:50 +03:00

1 2 3 4 5 ...

9280 Commits