scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 20:46:56 +00:00

Author	SHA1	Message	Date
Dawid Mędrek	fca03ca915	test/cqlpy/test_describe.py: Mark Scylla-only tests as such Tests verifying that auth and service levels are part of the output of `DESCRIBE SCHEMA` were not marked as `scylla_only` when they were written, but they're a feature only Scylla has. Because of that, let's mark them with `scylla_only` so they're not run against Cassandra to avoid unnecessary failures. We also provide a short explanation for each test why it's marked that way.	2025-07-17 21:45:44 +02:00
Petr Gusev	2027856847	Revert "paxos_state: read repair for intranode_migration" This reverts commit `45f5efb9ba`. The load_and_repair_paxos_state function was introduced in scylladb/scylladb#24478, but it has never been tested or proven useful. One set of problems stems from its use of local data structures from a remote shard. In particular, system_keyspace and schema_ptr cannot be directly accessed from another shard — doing so is a bug. More importantly, load_paxos_state on different shards can't ever return different values. The actual shard from which data is read is determined by sharder.shard_for_reads, and storage_proxy will jump back to the appropriate shard if the current one doesn't match. This means load_and_repair_paxos_state can't observe paxos state from write-but-not-read shard, and therefore will never be able to repair anything. We believe this explicit Paxos state read-repair is not needed at all. Any paxos state read which drives some paxos round forward is already accompanied by a paxos state write. Suppose we wrote the state to the old shard but not to the new shard (because of some error) while streaming is already finished. The RPC call (prepare or accept) will return error to the coordinator, such replica response won't affect the current round. This write won't affect any subsequent paxos rounds either, unless in those rounds the write actually succeeds on both shards, effectively 'auto-repairing' paxos state. Same if we managed to write to the new shard but not to the old shard. Any subsequent reads will observe either the old state or the new state (if the tablet already switched reads to the new shard). In any case, we'll have to write the state to all relevant shards from sharder.shard_for_writes (one or two) before sending rpc response, making this state visible for all subsequent reads. Thus, the monotonicity property ("once observed, the state must always be observed") appears to hold without requiring explicit read-repair and load_and_repair_paxos_state is not needed. Closes scylladb/scylladb#24926	2025-07-17 14:00:43 +02:00
Botond Dénes	20693edb27	Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to). In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface. Later, we will add BTI indexes which will also implement this interface. This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`. Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes. No backports needed, this is a preparation for new functionality. Closes scylladb/scylladb#25000 * github.com:scylladb/scylladb: sstables: add sstable::make_index_reader() and use where appropriate sstables/mx: in readers, use abstract_index_reader instead of index_reader sstables: in validate(), use abstract_index_reader instead of index_reader where possible test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader sstables/index_reader: introduce abstract_index_reader sstables/index_reader: extract a prefetch_lower_bound() method	2025-07-17 14:32:08 +03:00
Nadav Har'El	04b263b51a	Merge 'vector_index: do not create a view when creating a vector index' from Michał Hudobski This PR adds a way for custom indexes to decide whether a view should be created for them, as for the vector_index the view is not needed, because we store it in the external service. To allow this, custom logic for describing indexes using custom classes was added (as it used to depend on the view corresponding to an index). Fixes: VECTOR-10 Closes scylladb/scylladb#24438 * github.com:scylladb/scylladb: custom_index: do not create view when creating a custom index custom_index: refactor describe for custom indexes custom_index: remove unneeded duplicate of a static string	2025-07-17 13:48:49 +03:00
Michał Chojnowski	4e4a4b6622	sstables: add sstable::make_index_reader() and use where appropriate If we add multiple index implementations, users of index readers won't easily know which concrete index reader type is the right one to construct. We also don't want pieces of code to depend on functionality specific to certain concrete types, if that's not necessary. So instead of constructing the readers by themselves, they can use a helper function, which will return an abstract (virtual) index reader. This patch adds such a function, as a method of `sstable`.	2025-07-17 10:32:57 +02:00
Michał Chojnowski	1c4065e7dd	sstables/mx: in readers, use abstract_index_reader instead of index_reader This makes clear which methods of index_reader are available for use by sstable readers, and which aren't.	2025-07-17 10:32:57 +02:00
Michał Chojnowski	efcf3f5d66	sstables: in validate(), use abstract_index_reader instead of index_reader where possible After we add a second index implementation, we will probably want to adjust validate() to work with either implementation. Some validations will be format-specific, but some will be common. For now, let's use abstract_index_reader for the validations which can be done through that interface, and let's have downcast-specific codepaths for the others. Note: we change a `get_data_file_position()` call to `data_file_positions().start`. The call happens at the beginning of a partition, and at this points these two expressions are supposed to be equivalent.	2025-07-17 10:32:57 +02:00
Michał Chojnowski	92219a5ef8	test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader We don't want tests to create the concrete `index_reader` directly. We would like them to be able to test both sstables which use `index_reader`, and those which will use the planned new index implementation. So we will let the tests construct an abstract_index_reader and pass it to the index_reader_assertions, which will be able to assert the requested properties on various implementations as it wants.	2025-07-17 10:32:56 +02:00
Michał Chojnowski	c052ccd081	sstables/index_reader: introduce abstract_index_reader We want to implement BTI indexes in Scylla. After we do that, some sstables will use a BTI index reader, while others will use the old BIG index reader. To handle that, we can expose a common virtual "index reader" interface to sstable readers. This is what this patch does. This interface can't be quite fully implemented by a BTI index, because some methods returns keys which a BIG index stores, but a BTI index doesn't. So it will be further restricted in future patches. But for now, we only extract all methods currently used by the readers to a virtual interface.	2025-07-17 10:32:56 +02:00
Botond Dénes	fd6877c654	Merge 'alternator: avoid oversized allocation in Query/Scan' from Nadav Har'El This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator. The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches. Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page. In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running test/alternator/run --runveryslow \ test_query.py::test_query_large_page_small_rows reports in the log: oversized allocation: 573440 bytes. After this patch, this warning no longer appears. The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that. The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before. Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test. Fixes #23535 The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport. Closes scylladb/scylladb#24480 * github.com:scylladb/scylladb: alternator: clean up by co-routinizing alternator: avoid spamming the log when failing to write response alternator: clean up and simplify request_return_type alternator: avoid oversized allocation in Query/Scan	2025-07-17 11:30:40 +03:00
Calle Wilund	5dd871861b	tests::proc::process_fixture: Fix line handler adaptor buffering Fixes #24998 Helper routine translating input_stream buffers to single lines did not loop over current buffer state, leading to only the first line being sent to end listener. Rewrote to use range iteration instead. Nicer. Closes scylladb/scylladb#24999	2025-07-17 10:58:03 +03:00
Ernest Zaslavsky	342e94261f	s3_client: parse multipart response XML defensively Ensure robust handling of XML responses when initiating multipart uploads. Check for the existence of required nodes before access, and throw an exception if the XML is empty or malformed. Refs: https://github.com/scylladb/scylladb/issues/24676 Closes scylladb/scylladb#24990	2025-07-17 10:55:04 +03:00
Botond Dénes	054ea54565	Merge 'streaming: Avoid deadlock by running view checks in a separate scheduling group' from Tomasz Grabiec This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. Even if we didn't deadlock, and the streaming semaphore was simply exhausted by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes #24807 Fixes #24925 Closes scylladb/scylladb#24929 * github.com:scylladb/scylladb: streaming: Avoid deadlock by running view checks in a separate scheduling group service: migration_manager: Run group0 barrier in gossip scheduling group	2025-07-17 10:24:41 +03:00
Botond Dénes	4c832d583e	Merge 'repair: Speed up ranges calculation when small table optimization is on' from Asias He repair: Speed up ranges calculation when small table optimization is on Normally, during bootstrap, in repair_service::bootstrap_with_repair, we need to calculate which range to sync data from carefully for the new node. With small table optimization on, we pass a single full range and all peer nodes to row level repair to sync data with. Now that we only need to pass a single range and full peers, there is no need to calculate the ranges and peers in repair_service::bootstrap_with_repair and drop it later. The calculation takes time which slows down bootstrap, e.g., ``` Jul 08 22:01:41.927785 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]: [shard 0:strm] repair - bootstrap_with_repair: started with keyspace=system_distributed_everywhere, nr_ranges=23809 Jul 08 22:01:57.883797 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]: [shard 0:strm] repair - repair[79eac1a1-5d5b-4028-ae1c-06e68bec2d50]: sync data for keyspace=system_distributed_everywhere, status=started, reason=bootstrap, small_table_optimization=true ``` The range calculation took 15 seconds for system_distributed_everywhere table. To fix, the ranges calculation is skipped if small table optimization is on for the keyspace. Before: cluster dev [ PASS ] cluster.test_boot_nodes.1 104.59s After: cluster dev [ PASS ] cluster.test_boot_nodes.1 89.23s A 15% improvement to bootstrap 30 node cluster was observed. Fixes #24817 Closes scylladb/scylladb#24901 * github.com:scylladb/scylladb: repair: Speed up ranges calculation when small table optimization is on test: Add test_boot_nodes.py	2025-07-17 10:23:45 +03:00
Patryk Jędrzejczak	a654101c40	Merge 'test.py: add missed parameters that should be passed from test.py to pytest' from Andrei Chekun Several parameters that `test.py` should pass to pytest->boost were missing. This PR adds handling these parameters: `--random-seed` and `--x-log2-compaction-groups` Since this code affected with this issue in 2025.3 and this is only framework change, backport for that version needed. Fixes: https://github.com/scylladb/scylladb/issues/24927 Closes scylladb/scylladb#24928 * https://github.com/scylladb/scylladb: test.py: add bypassing x_log2_compaction_groups to boost tests test.py: add bypassing random seed to boost tests	2025-07-16 15:29:17 +02:00
Avi Kivity	c762425ea7	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start	2025-07-16 13:15:54 +03:00
Asias He	6c49b7d0ce	repair: Speed up ranges calculation when small table optimization is on Normally, during bootstrap, in repair_service::bootstrap_with_repair, we need to calculate which range to sync data from carefully for the new node. With small table optimization on, we pass a single full range and all peer nodes to row level repair to sync data with. Now that we only need to pass a single range and full peers, there is no need to calculate the ranges and peers in repair_service::bootstrap_with_repair and drop it later. The calculation takes time which slows down bootstrap, e.g., ``` Jul 08 22:01:41.927785 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]: [shard 0:strm] repair - bootstrap_with_repair: started with keyspace=system_distributed_everywhere, nr_ranges=23809 Jul 08 22:01:57.883797 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]: [shard 0:strm] repair - repair[79eac1a1-5d5b-4028-ae1c-06e68bec2d50]: sync data for keyspace=system_distributed_everywhere, status=started, reason=bootstrap, small_table_optimization=true ``` The range calculation took 15 seconds for system_distributed_everywhere table. To fix, the ranges calculation is skipped if small table optimization is on for the keyspace. Before: cluster dev [ PASS ] cluster.test_boot_nodes.1 104.59s After: cluster dev [ PASS ] cluster.test_boot_nodes.1 89.23s A 15% improvement to bootstrap 30 node cluster was observed. Fixes #24817	2025-07-16 15:33:15 +08:00
Piotr Dulikowski	a14b7f71fe	auth: fix crash when migration code runs parallel with raft upgrade The functions password_authenticator::start and standard_role_manager::start have a similar structure: they spawn a fiber which invokes a callback that performs some migration until that migration succeeds. Both handlers set a shared promise called _superuser_created_promise (those are actually two promises, one for the password authenticator and the other for the role manager). The handlers are similar in both cases. They check if auth is in legacy mode, and behave differently depending on that. If in legacy mode, the promise is set (if it was not set before), and some legacy migration actions follow. In auth-on-raft mode, the superuser is attempted to be created, and if it succeeds then the promise is _unconditionally_ set. While it makes sense at a glance to set the promise unconditionally, there is a non-obvious corner case during upgrade to topology on raft. During the upgrade, auth switches from the legacy mode to auth on raft mode. Thus, if the callback didn't succeed in legacy mode and then tries to run in auth-on-raft mode and succeds, it will unconditionally set a promise that was already set - this is a bug and triggers an assertion in seastar. Fix the issue by surrounding the `shared_promise::set_value` call with an `if` - like it is already done for the legacy case. Fixes: scylladb/scylladb#24975 Closes scylladb/scylladb#24976	2025-07-16 10:22:48 +03:00
Michał Chojnowski	1e7a292ef4	sstables/index_reader: extract a prefetch_lower_bound() method The sstable reader reaches directly for a `clustered_index_cursor`. But a BTI index reader won't be able to implement `clustered_index_cursor`, because a BTI index doesn't store full clustering keys, only some trie-encoded prefixes. So we want to weaken the dependency. Instead of reaching for `clustered_index_cursor`, we add a method which expresses our intent, and we let `index_reader` touch the cursor internally.	2025-07-16 00:13:20 +02:00
Andrzej Jackowski	77a9b5919b	main: utils: add thread names to alien workers This commit adds a call to `pthread_setname_np` in `alien_worker::spawn`, so each alien worker thread receives a descriptive name. This makes debugging, monitoring, and performance analysis easier by allowing alien workers to be clearly identified in tools such as `perf`.	2025-07-15 23:29:21 +02:00
Andrzej Jackowski	9574513ec1	auth: move passwords::check call to alien thread Analysis of customer stalls showed that the `detail::hash_with_salt` function, called from `passwords::check`, often blocks the reactor. This function internally uses the `crypt_r` function from an external library to compute password hashes, which is a CPU-intensive operation. To prevent such reactor stalls, this commit moves the `passwords::check` call to a dedicated alien thread. This thread is created at system startup and is shared by all shards. Within the alien thread, an `std::mutex` synchronizes access between the thread and the shards. While this could theoretically cause frequent lock contentions, in practice, even during connection storms, the number of new connections per second per shard is limited (typically hundreds per second). Additionally, the `_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not too many connections are processed at once. Fixes scylladb/scylladb#24524	2025-07-15 23:29:13 +02:00
Andrzej Jackowski	4ac726a3ff	test: wait for 3 clients with given username in test_service_level_api test_service_level_api tests create a new session and wait for all clients to authenticate. However, the check that all connections are authenticated is done by verifying that there are no connections with the username 'anonymous', which is insufficient if new connections have not yet been listed. To avoid test failures, this commit introduces an additional check that verifies all expected clients are present in the system.clients table before proceeding with the test.	2025-07-15 23:28:39 +02:00
Andrzej Jackowski	8d398fa076	auth: refactor password checking in password_authenticator This commit splits an if statement to two ifs, to make it possible to call `password::check` function from another (alien) thread in the next commit of this patch series. Ref. scylladb/scylladb#24524	2025-07-15 23:28:39 +02:00
Andrzej Jackowski	b3c6af3923	auth: make SHA-512 the only password hashing scheme for new passwords Before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt algorithms do not work because their scheme prefix lacks the required round count (e.g., it is `$2y$` instead of `$2y$05$`). We suspect this never worked as intended. Moreover, bcrypt tends to be slower than SHA-512, so we do not want to fix the prefix and start using it. - SHA-256 and SHA-512 are both part of the SHA-2 family, and libraries that support one almost always support the other. It is not expected to find a library that supports only SHA-256 but not SHA-512. - MD5 is not considered secure for password hashing. Therefore, this commit removes support for bcrypt_y, bcrypt_a, SHA-256, and MD5 for hashing new passwords to ensure that the correct hashing function (SHA-512) is used everywhere. This commit does not change the behavior of `passwords::check`, so it is still possible to use passwords hashed with the removed algorithms. Ref. scylladb/scylladb#24524	2025-07-15 23:28:33 +02:00
Andrzej Jackowski	62e976f9ba	auth: whitespace change in identify_best_supported_scheme() Remove tabs in `identify_best_supported_scheme()` to facilitate reuse of those lines after the for loop is removed. This change is motivated by the upcoming removal of support for obsolete password hashing schemes and removal of `identify_best_supported_scheme()` function. Ref. scylladb/scylladb#24524	2025-07-15 20:26:39 +02:00
Andrzej Jackowski	b20aa7b5eb	auth: require scheme as parameter for `generate_salt` This is a refactoring commit that changes the `generate_salt` function to require a password hashing scheme as a parameter. This change is motivated by the upcoming removal of support for obsolete password hashing schemes and removal of `identify_best_supported_scheme()` function. Ref. scylladb/scylladb#24524	2025-07-15 20:26:39 +02:00
Andrzej Jackowski	c4e6d9933d	auth: check password hashing scheme support on authenticator start This commit adds a check to the `password_authenticator` to ensure that at least one of the available password hashing schemes is supported by the current environment. It is better to fail at system startup rather than on the first attempt to use the password authenticator. This change is motivated by the upcoming removal of support for obsolete password hashing schemes and removal of `identify_best_supported_scheme()` function. Ref. scylladb/scylladb#24524	2025-07-15 20:26:33 +02:00
Botond Dénes	a26b6a3865	Merge 'storage: add `make_data_or_index_source` to the storages' from Ernest Zaslavsky Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance * Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior. * Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage` No backport needed since it enhances functionality which has not been released yet fixes: https://github.com/scylladb/scylladb/issues/22458 Closes scylladb/scylladb#23695 * github.com:scylladb/scylladb: sstables: Start using `make_data_or_index_source` in `sstable` sstables: refactor readers and sources to use coroutines sstables: coroutinize futurized readers sstables: add `make_data_or_index_source` to the `storage` encryption: refactor key retrieval encryption: add `encrypted_data_source` class	2025-07-15 13:32:13 +03:00
Andrei Chekun	a8fd38b92b	test.py: skip discovery when combined_test binary absent To discover what tests are included into combined_tests, pytest check this at the very beginning. In the case if combined_tests binary is missing, it will fail discovery and will not run test, even when it was not included into combined_tests. This PR changes behavior, so it will not fail when combined_tests is missing and only fail in case someone tries to run test from it. Closes scylladb/scylladb#24761	2025-07-15 09:49:02 +02:00
Ernest Zaslavsky	8d49bb8af2	sstables: Start using `make_data_or_index_source` in `sstable` Convert all necessary methods to be awaitable. Start using `make_data_or_index_source` when creating data_source for data and index components. For proper working of compressed/checksummed input streams, start passing stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.	2025-07-15 10:10:23 +03:00
Ernest Zaslavsky	dff9a229a7	sstables: refactor readers and sources to use coroutines Refactor readers and sources to support coroutine usage in preparation for integration with `make_data_or_index_source`. Move coroutine-based member initialization out of constructors where applicable, and defer initialization until first use.	2025-07-15 10:10:23 +03:00
Pavel Emelyanov	4debe3af5d	scylla-gdb: Don't show io_queue executing and queued resources These counters are no longer accounted by io-queue code and are always zero. Even more -- accounting removal happened years ago and we don't have Scylla versions built with seastar older than that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24835	2025-07-15 07:41:20 +03:00
Botond Dénes	641a907b37	Merge 'test/alternator: clean up write isolation default and add more tests for the different modes' from Nadav Har'El In #24442 it was noticed that accidentally, for a year now, test.py and CI were running the Alternator functional tests (test/alternator) using one write isolation mode (`only_rmw_uses_lwt`) while the manual test/alternator/run used a different write isolation mode (`always_use_lwt`). There is no good reason for this discrepancy, so in the second patch of this 2-patch series we change test/alternator/run to use the write isolation mode that we've had in CI for the last year. But then, discussion on #24442 started: Instead of picking one mode or the other, don't we need test both modes? In fact, all four modes? The honest answer is that running all tests with all combinations of options is not practical - we'll find ourselves with an exponentially growing number of tests. What we really need to do is to run most tests that have nothing to do with write isolation modes on just one arbitrary write isolation mode like we're doing today. For example, numerous tests for the finer details of the ConditionExpression syntax will run on one mode. But then, have a separate test that verifies that one representative example of ConditionExpression (for example) works correctly on all four write isolation modes - rejected in forbid_rmw mode, allowed and behaves as expected on the other three. We had some tests like that in our test suite already, but the first patch in this series adds many more, making the test much more exhaustive and making it easier to review that we're really testing all four write isolation modes in every scenario that matters. Fixes #24442 No need to backport this patch - it's just adding more tests and changing developer-only test behavior. Closes scylladb/scylladb#24493 * github.com:scylladb/scylladb: test/alternator: make "run" script use only_rmw_uses_lwt test/alternator: improve tests for write isolation modes	2025-07-15 07:16:18 +03:00
Patryk Jędrzejczak	21edec1ace	test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when: - both writes succeeded with the same replica responding first, - one of the following reads succeeded with the other replica responding before it applied mutations from any of the writes. We fix the test by not expecting reads with CL=ONE to return a row. We also harden the test by inserting different rows for every pair (CL, coordinator), where one of the two coordinators is a normal node from DC1, and the other one is a zero-token node from DC2. This change makes sure that, for example, every write really inserts a row. Fixes scylladb/scylladb#22967 The fix addresses CI flakiness and only changes the test, so it should be backported. Closes scylladb/scylladb#23518	2025-07-15 07:14:09 +03:00
Botond Dénes	2d3965c76e	Merge 'Reduce Alternator table name length limit to 192 and fix crash when adding stream to table with very long name' from Nadav Har'El Before this series, it is possible to crash Scylla (due to an I/O error) by creating an Alternator table close to the maximum name length of 222, and then enabling Alternator Streams. This series fixes this bug in two ways: 1. On a pre-existing table whose name might be up to 222 characters, enabling Streams will check if the resulting name is too long, and if it is, fail with a clear error instead of crashing. This case will effect pre-existing tables whose name has between 207 and 222 characters (207 is `222 - strlen("_scylla_cdc_log")`) - for such tables enabling Streams will fail, but no longer crash. 2. For new tables, the table name length limit is lowered from 222 to 192. The new limit is still high enough, but ensures it will be possible to enable streams any new table. It will also always be possible to add a GSI for such a table with name up to 29 characters (if the table name is shorter, the GSI name can be longer - the sum can be up to 221 characters). No need to backport, Alternator Streams is still an experimental feature and this patch just improves the unlikely situation of extremely long table names. Fixes #24598 Closes scylladb/scylladb#24717 * github.com:scylladb/scylladb: alternator: lower maximum table name length to 192 alternator: don't crash when adding Streams to long table name alternator: split length limit for regular and auxiliary tables alternator: avoid needlessly validating table name	2025-07-15 06:57:04 +03:00
Botond Dénes	26f135a55a	Merge 'Make KMIP host do nice TLS close on dropped connection + make PyKMIP test fixure not generate TLS noise + remove boost::process' from Calle Wilund Fixes #24873 In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors. This just makes sure `release` closes the connection if it neither retains or caches it. Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors. Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs. This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency. Closes scylladb/scylladb#24874 * github.com:scylladb/scylladb: encryption_test: Make PyKMIP run under seastar::experimental::process test/lib: Add wrapper helper for test process fixtures kmip_host: Close connections properly if dropped by pool being full encryption_at_rest_test: Do port check using TLS	2025-07-15 06:55:34 +03:00
Botond Dénes	1f9f43d267	Merge 'kms_host: Support external temporary security credentials' from Nikos Dragazis This PR extends the KMS host to support temporary AWS security credentials provided externally via the Scylla configuration file, environment variables, or the AWS credentials file. The KMS host already supports: * Temporary credentials obtained automatically from the EC2 instance metadata service or via IAM role assumption. * Long-term credentials provided externally via configuration, environment, or the AWS credentials file. This PR is about temporary credentials that are external, i.e., not generated by Scylla. Such credentials may be issued, for example, through identity federation (e.g., Okta + gimme-aws-creds). External temporary credentials are useful for short-lived tasks like local development, debugging corrupted SSTables with `scylla-sstable`, or other local testing scenarios. These credentials are temporary and cannot be refreshed automatically, so this method is not intended for production use. Documentation has been updated to mention these additional credential sources. Fixes #22470. New feature, no backport is needed. Closes scylladb/scylladb#22465 * github.com:scylladb/scylladb: doc: Expose new `aws_session_token` option for KMS hosts kms_host: Support authn with temporary security credentials encryption_config: Mention environment in credential sources for KMS	2025-07-15 06:45:39 +03:00
Jenkins Promoter	41bc6a8e86	Update pgo profiles - x86_64	2025-07-15 04:54:17 +03:00
Jenkins Promoter	b86674a922	Update pgo profiles - aarch64	2025-07-15 04:49:45 +03:00
Nadav Har'El	a248336e66	alternator: clean up by co-routinizing Reviewers of the previous patch complained on some ugly pre-existing code in alternator/executor.cc, where returning from an asynchronous (future) function require lengthy verbose casts. So this patch cleans up a few instances of these ugly casts by using co_return instead of return. For example, the long and verbose return make_ready_future<executor::request_return_type>( rjson::print(std::move(response))); can be changed to the shorter and more readable co_return rjson::print(std::move(response)); This patch should not have any functional implications, and also not any performance implications: I only coroutinized slow-path functions and one function that was already "partially" coroutinized (and this was expecially ugly and deserved being fixed). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-14 18:41:35 +03:00
Nadav Har'El	13ec94107a	alternator: avoid spamming the log when failing to write response Both make_streamed() and new make_streamed_with_extra_array() functions, used when returning a long response in Alternator, would write an error- level log message if it failed to write the response. This log message is probably not helpful, and may spam the log if the application causes repeated errors intentionally or accidentally. So drop these log messages. The exception is still thrown as usual. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-14 18:41:34 +03:00
Nadav Har'El	d8fab2a01a	alternator: clean up and simplify request_return_type The previous patch introduced a function make_streamed_with_extra_array which was a duplicate of the existing make_streamed. Reviewers complained how baroque the new function is (just like the old function), having to jump through hoops to return a copyable function working on non-copyable objects, making strange-named copies and shared pointers of everything. We needed to return a copyable function (std::function) just because Alternator used Seastar's json::json_return_type in the return type from executor function (request_return_type). This json_return_type contained either a sstring or an std::function, but neither was ever really appropriate: 1. We want to return noncopyable_function, not an std::function! 2. We want to return an std::string (which rjson::print()) returns, not an sstring! So in this patch we stop using seastar::json::json_return_type entirely in Alternator. Alternator's request_return_type is now an std::variant of three types: 1. std::string for short responses, 2. noncopyable_function for long streamed response 3. api_error for errors. The ugliest parts of make_streamed() where we made copies and shared pointers to allow for a copyable function are all gone. Even nicer, a lot of other ugly relics of using seastar::json_return_type are gone: 1. We no longer need obscure classes and functions like make_jsonable() and json_string() to convert strings to response bodies - an operation can simply return a string directly - usually returning rjson::print(value) or a fixed string like "" and it just works. 2. There is no more usage of seastar::json in Alternator (except one minor use of seastar::json::formatter::to_json in streams.cc that can be removed later). Alternator uses RapidJSON for its JSON needs, we don't need to use random pieces from a different JSON library. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-14 18:41:34 +03:00
Nadav Har'El	2385fba4b6	alternator: avoid oversized allocation in Query/Scan This patch fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator. Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page. In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running test/alternator/run --runveryslow \ test_query.py::test_query_large_page_small_rows reports in the log: oversized allocation: 573440 bytes. After this patch, this warning no longer appears. The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that. The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before. Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test. Fixes #23535	2025-07-14 18:41:34 +03:00
Patryk Jędrzejczak	145a38bc2e	Merge 'raft: fix voter assignment of transitioning nodes' from Emil Maskovsky Previously, nodes would become voters immediately after joining, ensuring voter status was established before bootstrap completion. With the limited voters feature, voter assignment became deferred, creating a timing gap where nodes could finish bootstrapping without becoming voters. This timing issue could lead to quorum loss scenarios, particularly observed in tests but theoretically possible in production environments. This commit reorders voter assignment to occur before the `update_topology_state()` call, ensuring nodes achieve voter status before bootstrap operations are marked complete. This prevents the problematic timing gap while maintaining compatibility with limited voters functionality. If voter assignment succeeds but topology state update fails, the operation will raise an exception and be retried by the topology coordinator, maintaining system consistency. This commit also fixes issue where the `update_nodes` ignored leaving voters potentially exceeding the voter limit and having voters unaccounted for. Fixes: scylladb/scylladb#24420 No backport: Fix of a theoretical bug + CI stability improvement (we can backport eventually later if we see hits in branches) Closes scylladb/scylladb#24843 * https://github.com/scylladb/scylladb: raft: fix voter assignment of transitioning nodes raft: improve comments in group0 voter handler	2025-07-14 16:12:03 +02:00
Calle Wilund	722e2bce96	encryption_test: Make PyKMIP run under seastar::experimental::process Removes the requirement of boost::process, and all its non-seastar-ness. Hopefully also makes the IO and shutdown handling a bit more reliable.	2025-07-14 12:18:16 +00:00
Calle Wilund	253323bb64	test/lib: Add wrapper helper for test process fixtures Adds a wrapper for seastar::experimental::process, to help use external process fixtures in unit test. Mainly to share concepts such as line reading of stdout/err etc, and sync the shutdown of these. Also adds a small path searcher to find what you want to run.	2025-07-14 12:18:16 +00:00
Yaron Kaikov	fdcaa9a7e7	dist/common/scripts/scylla_sysconfig_setup: fix `SyntaxWarning: invalid escape sequence` There are invalid escape sequence warnings where raw strings should be used for the regex patterns Fixes: https://github.com/scylladb/scylladb/issues/24915 Closes scylladb/scylladb#24916	2025-07-14 11:20:41 +02:00
Benny Halevy	692b79bb7d	compaction: get_max_purgeable_timestamp: improve trace log messages Print the keyspace.table names, issue trace log messages also when returning early if tombstone_gc is disabled or when gc_check_only_compacting_sstables is set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24914	2025-07-14 11:16:58 +02:00
Calle Wilund	514fae8ced	kmip_host: Close connections properly if dropped by pool being full Fixes #24873 Note: this happens like never. But if we, in KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors. While not very serious, this would lead to possible TLS errors in the KMIP host used, which should be avoided if possible. Fix is simple, just make release close the connection if it neither retains nor caches it.	2025-07-14 08:31:02 +00:00
Calle Wilund	0fe8836073	encryption_at_rest_test: Do port check using TLS If we connect using just a socket, and don't terminate connection nicely, we will get annoying errors in PyKMIP log. These distract from real errors. So avoid them.	2025-07-14 08:31:02 +00:00

1 2 3 4 5 ...

48527 Commits