scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-12 19:02:12 +00:00

Author	SHA1	Message	Date
Piotr Wieczorek	4f6aeb7b6b	alternator: Change the return type of rmw_operation_return Change the type from future<executor::request_return_type> to executor::request_return_type, because the method isn't async and one out of two callers unwraps the future immediately. This simplifies the code a little and probably saves a few instructions, since we suspect that moving a future<X> is more expensive than just moving X.	2025-10-30 07:40:31 +01:00
Piotr Wieczorek	ffdc8d49c7	config: Add alternator_streams_strict_compatibility flag With this flag enabled, Alternator Streams produces more accurate event types: - nop operations (i.e. replacing an item with an identical one, deleting a nonexistent item) don't produce an event, - updates of an existing item produce a MODIFY event, instead of INSERT, - etc. This flag affects the internal behaviour of some operations, i.e. Alternator may select a preimage and propagate it to CDC (in contrary to CDC making the request), or do extra item comparisons (i.e. compare the existing item with the new one). These operations may be costly, and users that don't use Streams won't need them. This flag is live-updatable. An operation reads this flag once, and uses its value for the entire operation.	2025-10-30 07:40:31 +01:00
Piotr Wieczorek	e3fde8087a	cdc: Don't split a row marker away from row cells CDC log table records a mutation as a sequence of log rows that record an atomic change (i.e. a row marker, tombstones, etc.), whereas a mutation in Alternator Streams always appears as a single log row. The type of operation is determined based on the type of the last log row in CDC. As a result, updates that create a row always appeared to Alternator Streams as an update (row marker + data), rather than an insert. This commit makes them a single log row. Its operation type is insert if it contains a row marker, and an update otherwise, which gives results consistent with DynamoDB Streams.	2025-10-30 07:40:31 +01:00
Nadav Har'El	aa34f0b875	alternator: fix CDC events for TTL expiration In commit `a3ec6c7d1d` we supposedly implemented the feature of telling TTL experation events from regular user-sent deletions. However, that implementation did not actually work at all... It had two bugs: 1. It created an null rjson::value() instead of an empty dictionary rjson::empty_object(), so GetRecords failed every time such a TTL expiration event was generated. 2. In the output, it used lowercase field names "type" and "principalId" instead of the uppercase "Type" and "PrincipalId". This is not the correct capitalization, and when boto3 recieves such incorrect fields it silently deletes them and never passes them to the user's get_records() call. This patch fixes those two bugs, and importantly - enables a test for this feature. We did already have such a test but it was marked as "veryslow" so doesn't run in CI and apparently not even run once to check the new feature. This test is not actually very long on Alternator when the TTL period is set very low (as we do in our tests), so I replaced the "veryslow" marker by "waits_for_expiration". The latter marker means that the test is still very slow - as much as half an hour - on DynamoDB - but runs quickly on Scylla in our test setup, and enabled in CI by default. The enabled test failed badly before this patch (a server error during GetRecords), and passes with this patch. Also, the aforementioned commit forgot to remove the paragraph in Alternator's compatibility.md that claims we don't have that feature yet. So we do it now. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26633	2025-10-29 17:08:20 +01:00
Piotr Wieczorek	2812e67f47	cdc: Emit a preimage for non-clustered tables Until this patch, CDC haven't fetched a preimage for mutations containing only a partition tombstone. Therefore, single-row deletions in a table witout a clustering key didn't include a preimage, which was inconsistent with single-row clustered deletions. This commit addresses this inconsistency. Second reason is compatibility with DynamoDB Streams, which doesn't support entire-partition deletes. Alternator uses partition tombstones for single-row deletions, though, and in these cases the 'OldImage' was missing from REMOVE records. Fixes https://github.com/scylladb/scylladb/issues/26382 Closes scylladb/scylladb#26578	2025-10-29 17:54:58 +02:00
Nadav Har'El	29ed1f3de7	Merge 'cql3: Refactor vector search select impl into a dedicated class' from Karol Nowacki cql3: Refactor vector search select impl into a dedicated class The motivation for this change is crash fixed in https://github.com/scylladb/scylladb/pull/25500. This commit refactors how ANN ordered select statements are handled to prevent a potential null pointer dereference and improve code organization. Previously, vector search selects were managed by `indexed_table_select_statement`, which unconditionally dereferenced a `view_ptr`. This assumption is invalid for vector search indexes where no view exists, creating a risk of crashes. To address this, the refactoring introduces the following changes: - A new `vector_indexed_table_select_statement` class is created to specifically handle ANN-ordered selects. This class operates without a view_ptr, resolving the null pointer risk. - The `indexed_table_select_statement` is renamed to `view_indexed_table_select_statement` to more accurately reflect its function with view-based indexes. - An assertion has been added to `indexed_table_select_statement` constructor to ensure view_ptr is not null, preventing similar issues in the future. Fixes: VECTOR-162 No backport is needed, as this is refactoring. Closes scylladb/scylladb#25798 * github.com:scylladb/scylladb: cql3: Rename indexed_table_select_statement cql3: Move vector search select to dedicated class	2025-10-29 17:49:24 +02:00
Piotr Dulikowski	aba922ea65	Merge 'cdc: improve cdc metadata loading' from Michael Litvak when loading CDC streams metadata for tablets from the tables, read only new entries from the history table instead of reading all entries. This improves the CDC metadata reloading, making it more efficient and predictable. the CDC metadata is loaded as part of group0 reload whenever the internal CDC tables are modified. on tablet split / merge, we create a new CDC timestamp and streams by writing them to the cdc_streams_history table by group0 operation, and when it's applied we reload the in-memory CDC streams map by reading from the tables and constructing the updated map. Previously, on every update, we would read the entire cdc_streams_history entries for the changed table, constructing all its streams and creating a new map from scratch. We improve this now by reading only new entries from cdc_streams_history and append them to the existing map. we can do this because we only append new entries to cdc_streams_history with higher timestamp than all previous entries. This makes this reloading more efficient and predictable, because previously we would read a number of entries that depends on the number of tablets splits and merges, which increases over time and is unbounded, whereas now we read only a single stream set on each update. Fixes https://github.com/scylladb/scylladb/issues/26732 backport to 2025.4 where cdc with tablets is introduced Closes scylladb/scylladb#26160 * github.com:scylladb/scylladb: test: cdc: extend cdc with tablets tests cdc: improve cdc metadata loading	2025-10-29 11:07:48 +01:00
Michał Hudobski	46589bc64c	secondary_index: disallow multiple vector indexes on the same column We currently allow creating multiple vector indexes on one column. This doesn't make much sense as we do not support picking one when making ann queries. To make this less confusing and to make our behavior similar to Cassandra we disallow the creation of multiple vector indexes on one column. We also add a test that checks this behavior. Fixes: VECTOR-254 Fixes: #26672 Closes scylladb/scylladb#26508	2025-10-29 11:55:38 +02:00
Patryk Jędrzejczak	7304afb75a	Merge 'vnodes cleanup: renames and code comments fixes' from Petr Gusev This is a follow-up for https://github.com/scylladb/scylladb/pull/26315. Fixes several review comments that were left unresolved in the original PR. backport: not needed, this PR contains only renames and code comment fixes Closes scylladb/scylladb#26745 * https://github.com/scylladb/scylladb: test_automatic_cleanup: fix comment storage_proxy: remove stale comment storage_proxy: improve run_fenceable_write comment topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup	2025-10-29 10:38:27 +01:00
Dawid Mędrek	48cbf6b37a	test/cluster/test_tablets: Migrate dtest We migrate `tablets_test.py::TestTablets::test_moving_tablets_replica_on_node` from dtests to the repository of Scylla. We divide the test into two steps to make testing easier and even possible with RF-rack-valid keyspaces being enforced. Closes scylladb/scylladb#26285	2025-10-29 11:09:48 +02:00
Karol Nowacki	9f1fd7f5a0	cql3: Rename indexed_table_select_statement To align with `vector_indexed_table_select_statement`, this commit renames `indexed_table_select_statement` to `view_indexed_table_select_statement` to clarify its usage with materialized views.	2025-10-29 08:37:25 +01:00
Karol Nowacki	357c0a8218	cql3: Move vector search select to dedicated class The execution of SELECT statements with ANN ordering (vector search) was previously implemented within `indexed_table_select_statement`. This was not ideal, as vector search logic is independent of secondary index selects. This resulted in unnecessary complexity because vector search queries don't use features like aggregates or paging. More importantly, `indexed_table_select_statement` assumed a non-null `view_schema` pointer, which doesn't hold for vector indexes (where `view_ptr` is null). This caused null pointer dereferences during ANN ordered selects, leading to crashes (VECTOR-179). Other parts of the class still dereference `view_schema` without null checks. Moving the vector search select logic out of `indexed_table_select_statement` simplifies the code and prevents these null pointer dereferences.	2025-10-29 08:37:21 +01:00
Petr Gusev	b6bcd062de	test_automatic_cleanup: fix comment	2025-10-28 17:55:20 +01:00
Petr Gusev	d49be677d5	storage_proxy: remove stale comment	2025-10-28 17:55:20 +01:00
Petr Gusev	c60223f009	storage_proxy: improve run_fenceable_write comment	2025-10-28 17:55:20 +01:00
Petr Gusev	58d100a0cb	topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes	2025-10-28 17:55:20 +01:00
Petr Gusev	fa9dc71f30	storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed	2025-10-28 17:55:19 +01:00
Dawid Mędrek	5e03b01107	test/cluster: Add test_simulate_upgrade_legacy_to_raft_listener_registration We provide a reproducer test of the bug described in scylladb/scylladb#18049. It should fail before the fix introduced in scylladb/scylladb@7ea6e1ec0a, and it should succeed after it. Refs scylladb/scylladb#18049 Fixes scylladb/scylladb#18071 Closes scylladb/scylladb#26621	2025-10-28 17:32:15 +01:00
Yauheni Khatsianevich	99dc31e71a	tests(lwt): new test for LWT testing during tablet resize – Workload: N workers perform CAS updates UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read. - Enable balancing and increase min_tablet_count to force split, flush and lower min_tablet_count to merge. - “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a LOCAL_SERIAL read to determine whether the CAS actually applied. - Invariants: after the run, for every pk and column s{i}, the stored value equals the number of confirmed CAS by worker i (no lost or phantom updates) despite ongoing tablet moves. Closes scylladb/scylladb#26113	2025-10-28 16:48:57 +01:00
Petr Gusev	d300adc10c	storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup This cleanup is only for vnodes-based tables, reflect this in the function name.	2025-10-28 15:37:28 +01:00
Michael Litvak	4cc0a80b79	test: cdc: extend cdc with tablets tests extend and improve the tests of virtual tables for cdc with tablets. split the existing virtual tables test to one test that validates the virtual tables against the internal cdc tables, and triggering some tablet splits in order to create entries in the cdc_streams_history table, and add another test with basic validation of the virtual tables when there are multiple cdc tables.	2025-10-28 15:06:21 +01:00
Pavel Emelyanov	ae0136792b	utils: Make directory_lister use generator lister from seastar The directory_lister uses utils::lister under the hood which accepts a callback to put directory_entry-s in. The directory_lister's callback then puts the entries into a queue and its .get() method pops up entries from there to return to caller. This patch simplifies this code by switching the directory_lister to use experimental generator lister from seastar. With it, the entries to be returned from .get() are simply co_await-ed from calling the generator object (wich co_yield-s them). As a result the directory_lister becomes smaller and drops the need for utils::lister. Since directory_lister was created as a replacement for that callback-based lister, the latter can be eventually removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26586	2025-10-28 15:20:20 +02:00
Pavel Emelyanov	948cefa5f9	test: Extend API consistency test with tokens_endpoint endpoint Recently (#26231) there was added a test to check that several API endpoints, that return tokens and corresponding replica nodes, are consistent with tablet map. This patch adds one more API endpoint to the validation -- the /storage_service/tokens_endpoint one. The extention is pretty straightforward, but the new endpoint returns back a single (primary) replica for a token, so the test check is slightly modified to account for that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26580	2025-10-28 15:18:09 +02:00
Nadav Har'El	87573197d4	test/alternator: reproducers for missing headers and request limit This patch adds reproducing tests in test/alternator for issue #23438, which is about missing checks for the length of headers and the URL in Alternator requests. These should be limited, because Seastar's HTTP server, which Scylla uses, reads them into memory so they can OOM Scylla. The tests demonstrate that DynamoDB enforces a 16 KB limit on the headers and the URL of the request, but Scylla doesn't (a code inspection suggests it does not in fact have any limit). The two tests pass on DynamoDB and currently xfail on Alternator. Refs #23438. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23442	2025-10-28 15:12:25 +02:00
Pavel Emelyanov	d9bfbeda9a	lister: Fix race between readdir and stat Sometimes file::list_directory() returns entries without type set. In thase case lister calls file_type() on the entry name to get it. In case the call returns disengated type, the code assumes that some error occurred and resolves into exception. That's not correct. The file_type() method returns disengated type only if the file being inspected is missing (i.e. on ENOENT errno). But this can validly happen if a file is removed bettween readdir and stat. In that case it's not "some error happened", but a enry should be just skipped. In "some error happened", then file_type() would resolve into exceptional future on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26595	2025-10-28 15:10:22 +02:00
Botond Dénes	ac618a53f4	Merge 'db: repair: do not update repair_time if batchlog replay failed' from Aleksandra Martyniuk Currently, batchlog replay is considered successful even if all batches fail to be sent (they are replayed later). However, repair requires all batches to be sent successfully. Currently, if batchlog isn't cleared, the repair never learns and updates the repair_time. If GC mode is set to "repair", this means that the tombstones written before the repair_time (minus propagation_delay) can be GC'd while not all batches were replied. Consider a scenario: - Table t has a row with (pk=1, v=0); - There is an entry in the batchlog that sets (pk=1, v=1) in table t; - The row with pk=1 is deleted from table t; - Table t is repaired: - batchlog reply fails; - repair_time is updated; - propagation_delay seconds passes and the tombstone of pk=1 is GC'd; - batchlog is replayed and (pk=1, v=1) inserted - data resurrection! Do not update repair_time if sending any batch fails. The data is still repaired. For tablet repair the repair runs, but at the end the exception is passed to topology coordinator. Thanks to that the repair_time isn't updated. The repair request isn't removed as well, due to which the repair will need to rerun. Apart from that, a batch is removed from the batchlog if its version is invalid or unknown. The condition on which we consider a batch too fresh to replay is updated to consider propagation_delay. Fixes: https://github.com/scylladb/scylladb/issues/24415 Data resurrection fix; needs backport to all versions Closes scylladb/scylladb#26319 * github.com:scylladb/scylladb: db: fix indentation test: add reproducer for data resurrection repair: fail tablet repair if any batch wasn't sent successfully db/batchlog_manager: fix making decision to skip batch replay db: repair: throw if replay fails db/batchlog_manager: delete batch with incorrect or unknown version db/batchlog_manager: coroutinize replay_all_failed_batches	2025-10-28 14:52:59 +02:00
Botond Dénes	f3cec5f11a	Merge 'index: Set tombstone_gc when creating underlying view' from Dawid Mędrek Before this commit, when the underlying materialized view was created, it didn't have the property `tombstone_gc` set to any value. We fix the bug in this PR. Implementation strategy: 1. Move code responsible for producing the schema of a secondary index to the file that handles `CREATE INDEX`. 2. Set the property when creating the view. 3. Add reproducer tests. Fixes scylladb/scylladb#26542 Backport: we can discuss it. Closes scylladb/scylladb#26543 * github.com:scylladb/scylladb: index: Set tombstone_gc when creating secondary index index: Make `create_view_for_index` method of `create_index_statement` index: Move code for creating MV of secondary index to cql3 db, cql3: Move creation of underlying MV for index	2025-10-28 14:42:42 +02:00
Nadav Har'El	c3593462a4	alternator: improve protection against oversized requests Following DynamoDB, Alternator also places a 16 MB limit on the size of a request. Such a limit is necessary to avoid running out of memory - because the AWS message authentication protocol requires reading the entire request into memory before its signature can be verified. Our implementation for this limit used Seastar's HTTP server's content_length_limit feature. However, this Seastar feature is incomplete - it only works when the request uses the Content-Length header, and doesn't do anything if the request doesn't have a Content-Length (it may use chunked encoding, or have no length at all). So malicious users can cause Scylla to OOM by sending a huge request without a Content-Length. So in this patch we stop using the incomplete Seastar feature, and implement the length limit in Scylla in a way that works correctly with or without Content-Length: We read from the input stream and if we go over 16MB, we generate an error. Because we dropped Seastar's protection against a long Content-Length, we also need to fix a piece of code which used Content-Length to reserve some semaphore units to prevent reading many large requests in parallel. We fix two problems in the code: 1. If Content-Length is over the limit, we shouldn't attempt to reserve semaphore units - this should just be a Payload Too Large error. 2. If Content-Length is missing, the existing code did nothing and had a TODO that we should. In this patch we implement what was suggested in that TODO: We temporarily reserve the whole 16 MB limit, and after reading the actual request, we return part of the reservation according to the real request size. That last fix is important, because typically the largest requests will be BatchWriteItem where a well-written client would want to use chunked encoding, not Content-Length, to avoid materializing the entire request up-front. For such clients, the memory use semaphore did nothing, and now it does the right thing. Note that this patch does not solve the problem #12166 that existed with Seastar's length-limiting implementation but still exists in the new in-Scylla length-limiting implementation: The fact we send an error response in the middle of the request and then close the connection, while the client continues to send the request, can lead to an RST being sent by the server kernel. Usually this will be fine - well-written client libraries will be able to read the response before the RST. But even with a well-written library in some rare timings the client may get the RST before the response, and will miss the response, and get an empty or partial response or "connection reset by peer". This issue existed before this patch, and still exists, but is probably of minor impact. Fixes #8196 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23434	2025-10-28 15:24:46 +03:00
Lakshmi Narayanan Sreethar	64c1ec99e0	cmake: link crypto lib to utils The utils library requires OpenSSL's libcrypto for cryptographic operations and without linking libcrypto, builds fail with undefined symbol errors. Fix that by linking `crypto` to `utils` library when compiled with cmake. The build files generated with configure.py already have `crypto` lib linked, so they do not have this issue. Fix #26705 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26707	2025-10-28 14:11:03 +02:00
Anna Stuchlik	6fa342fb18	doc: add OS support for version 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26450 Closes scylladb/scylladb#26616	2025-10-28 13:29:40 +03:00
Radosław Cybulski	ea6b22f461	Add max trace size output configuration variable In #24031 users complained, that trace message is truncated, namely it's no longer json parsable and table name might not be part of the output. This path enables users to configure maximum size of trace message. In case user wanted `table` name, but didn't care about message size, #26634 will help. - add configuration varable `alternator_max_users_query_size_in_trace_output` with default value of 4096 (4 times old default value). - modify `truncated_content_view` function to use new configuration variable for truncation limit - update `truncated_content_view` to consistently truncate at given size, previously trunctation would also happen when data arrived in more than one chunk - update `truncated_content_view` to better handle truncated value (limit number of copies) - fix `scylla_config_read` call - call to `query` for a configuration name that is not existing will return `Items` array empty (but present) - this would raise array access exception few lines below. - add test Refs #26634 Refs #24031 Closes scylladb/scylladb#26618	2025-10-28 13:29:15 +03:00
Pavel Emelyanov	ac1d709709	Merge 'tablet_sstable_streamer: defer SSTable unlinking until fully streamed' from Taras Veretilnyk When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files. test_tablets2::test_tablet_load_and_stream was enhanced to also verify that SSTables are removed after being streamed. Fixes #26606 Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444. Closes scylladb/scylladb#26622 * github.com:scylladb/scylladb: test_tablets2: verify SSTable cleanup after tablet load and stream tablet_sstable_streamer: replace unlink() call with mark_for_deletion()	2025-10-28 13:25:40 +03:00
Patryk Jędrzejczak	5321720853	test: test_raft_recovery_stuck: reconnect driver after rolling restarts It turns out that #21477 wasn't sufficient to fix the issue. The driver may still decide to reconnect the connection after `rolling_restart` returns. One possible explanation is that the driver sometimes handles the DOWN notification after all nodes consider each other UP. Reconnecting the driver after restarting nodes seems to be a reliable workaround that many tests use. We also use it here. Fixes #19959 Closes scylladb/scylladb#26638	2025-10-28 13:24:23 +03:00
Anna Stuchlik	bd5b966208	doc: add --list-active-releases to Web Installer Fixes https://github.com/scylladb/scylladb/issues/26688 V2 of https://github.com/scylladb/scylladb/pull/26687 Closes scylladb/scylladb#26689	2025-10-28 13:21:57 +03:00
Pavel Emelyanov	54a117b19d	Merge 'retry_strategy: Switch to using seastar's retry_strategy (take two)' from Ernest Zaslavsky With the recent introduction of retry_strategy to Seastar, the pure virtual class previously defined in ScyllaDB is now redundant. This change allows us to streamline our codebase by directly inheriting from Seastar’s implementation, eliminating duplication in ScyllaDB. Despite this update is purely a refactoring effort and does not introduce functional changes it should be ported back to 2025.3 and 2025.4 otherwise it will make future backports of bugfixes/improvements related to `s3_client` near to impossible ref: https://github.com/scylladb/seastar/issues/2803 depends on: https://github.com/scylladb/seastar/pull/2960 Closes scylladb/scylladb#25801 * github.com:scylladb/scylladb: s3_client: remove unnecessary `co_await` in `make_request` s3 cleanup: remove obsolete retry-related classes s3_client: remove unused `filler_exception` s3_client: fix indentation s3_client: simplify chunked download error handling using `make_request` s3_client: reformat `make_request` functions for readability s3_client: eliminate duplication in `make_request` by using overload s3_client: reformat `make_request` function declarations for readability s3_client: reorder `make_request` and helper declarations s3_client: add `make_request` override with custom retry and error handler s3_client: migrate s3_client to Seastar HTTP client s3_client: fix crash in `copy_s3_object` due to dangling stream s3_client: coroutinize `copy_s3_object` response callback aws_error: handle missing `unexpected_status_error` case s3_creds: use Seastar HTTP client with retry strategy retry_strategy: add exponential backoff to `default_aws_retry_strategy` retry_strategy: introduce Seastar-based retry strategy retry_strategy: update CMake and configure.py for new strategy retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy` retry_strategy: fix include retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc	2025-10-28 13:08:42 +03:00
Patryk Jędrzejczak	820c8e7bc4	Merge 'LWT: use shards_ready_for_reads for replica locks' from Petr Gusev When a tablet is migrated between shards on the same node, during the write_both_read_new state we begin switching reads to the new shard. Until the corresponding global barrier completes, some requests may still use write_both_read_old erm, while others already use the write_both_read_new erm. To ensure mutual exclusion between these two types of requests, we must acquire locks on both the old and new shards. Once the global barrier completes, no requests remain on the old shard, so we can safely switch to acquiring locks only on the new shard. The idea came from the similar locking problem in the [counters for tablets PR](https://github.com/scylladb/scylladb/pull/26636#discussion_r2463932395). Fixes scylladb/scylladb#26727 backport: need to backport to 2025.4 Closes scylladb/scylladb#26719 * https://github.com/scylladb/scylladb: paxos_state: use shards_ready_for_reads paxos_state: inline shards_for_writes into get_replica_lock	2025-10-28 10:37:53 +01:00
Avi Kivity	d81796cae3	Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. Fixes https://github.com/scylladb/scylladb/issues/25341 Closes scylladb/scylladb#25456 * github.com:scylladb/scylladb: mv: limit concurrent view updates from all sources database: rename _view_update_concurrency_sem to _view_update_memory_sem	2025-10-28 11:13:24 +02:00
Aleksandra Martyniuk	910cd0918b	locator: use get_primary_replica for get_primary_endpoints Currently, tablet_sstable_streamer::get_primary_endpoints is out of sync with tablet_map::get_primary_replica. The get_primary_replica optimizes the choice of the replica so that the work is fairly distributes among nodes. Meanwhile, get_primary_endpoints always chooses the first replica. Use get_primary_replica for get_primary_endpoints. Fixes: https://github.com/scylladb/scylladb/issues/21883. Closes scylladb/scylladb#26385	2025-10-28 09:56:08 +02:00
Michael Litvak	8743422241	cdc: improve cdc metadata loading when loading CDC streams metadata for tablets from the tables, read only new entries from the history table instead of reading all entries. This improves the CDC metadata reloading, making it more efficient and predictable. the CDC metadata is loaded as part of group0 reload whenever the internal CDC tables are modified. on tablet split / merge, we create a new CDC timestamp and streams by writing them to the cdc_streams_history table by group0 operation, and when it's applied we reload the in-memory CDC streams map by reading from the tables and constructing the updated map. Previously, on every update, we would read the entire cdc_streams_history entries for the changed table, constructing all its streams and creating a new map from scratch. We improve this now by reading only new entries from cdc_streams_history and append them to the existing map. we can do this because we only append new entries to cdc_streams_history with higher timestamp than all previous entries. This makes this reloading more efficient and predictable, because previously we would read a number of entries that depends on the number of tablets splits and merges, which increases over time and is unbounded, whereas now we read only a single stream set on each update. Fixes scylladb/scylladb#26732	2025-10-28 08:54:09 +01:00
Wojciech Mitros	f07a86d16e	mv: limit concurrent view updates from all sources Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. The effect of this patch can also be observed when writing to a base table with a large number of materialized views, like in the materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent dtest. In that test, if we perform a full scan in parallel to a write workload with a concurrency of 100 to a table with 100 views, the scan would sometimes timeout because it would effectively get 1/10000 of cpu. With this patch, the cpu concurrency of view updates was limited to 128 (we ran both writes and scan in the same service level), and the scan no longer timed out. Fixes https://github.com/scylladb/scylladb/issues/25341	2025-10-27 18:55:41 +01:00
Pavel Emelyanov	81f598225e	error_injection: Add template parameter default for in release mode The std::optional<T> inject_parameter(...) method is a template, and in dev/debug modes this parameter is defaulted to std::string_view, but for release mode it's not. This patch makes it symmetrical. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26706	2025-10-27 16:39:22 +01:00
Taras Veretilnyk	1361ae7a0a	test_tablets2: verify SSTable cleanup after tablet load and stream Modify existing test_tablet_load_and_stream testcase to verify that SSTable files are properly deleted from the upload directory after streaming.	2025-10-27 16:36:08 +01:00
Taras Veretilnyk	517a4dc4df	tablet_sstable_streamer: replace unlink() call with mark_for_deletion() When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patches replaces unlink() call with mark_for_deletion()	2025-10-27 16:30:05 +01:00
Petr Gusev	478f7f545a	paxos_state: use shards_ready_for_reads Acquiring locks on both shards for the entire tablet migration period is redundant. In most cases, locking only the old shard or only the new shard is sufficient. Using shards_ready_for_reads reduces the situations in which we need to lock both shards to: * intra-node migrations only * only during the write_both_read_new state Once the global barrier completes in the write_both_read_new state, no requests remain on the old shard, so we can safely acquire locks only on the new shard. Fixes scylladb/scylladb#26727	2025-10-27 16:22:28 +01:00
Piotr Dulikowski	fd966ec10d	Merge 'cdc: garbage collect CDC streams for tablets' from Michael Litvak introduce helper functions that can be used for garbage collecting old cdc streams for tablets-based keyspaces. add a background fiber to the topology coordinator that runs periodically and checks for old CDC streams for tablets keyspaces that can be garbage collected. the garbage collection works by finding the newest cdc timestamp that has been closed for more than the configured cdc TTL, and removing all information from the cdc internal tables about cdc timestamps and streams up to this timestamp. in general it should be safe to remove information about these streams because they are closed for more than TTL, therefore all rows that were written to these streams with the configured TTL should be dead. the exception is if the TTL is altered to a smaller value, and then we may remove information about streams that still have live rows that were written with the longer ttl. Fixes https://github.com/scylladb/scylladb/issues/26669 Closes scylladb/scylladb#26410 * github.com:scylladb/scylladb: cdc: garbage collect CDC streams periodically cdc: helpers for garbage collecting old streams for tablets	2025-10-27 16:16:55 +01:00
Michał Hudobski	541b52cdbf	cql: fail with a better error when null vector is passed to ann query Currently when a null vector is passed to an ANN query we fail with a quite confusing error ("NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.0.0.1:9042 datacenter1>: <Error from server: code=0000 [Server error] message="to_bytes() called on raw value that is null">})"). This patch fixes that by throwing an InvalidRequestException with an appropriate message instead. We also add a test case that validates this behavior. Fixes: VECTOR-257 Closes scylladb/scylladb#26510	2025-10-27 16:09:08 +02:00
Botond Dénes	417270b726	Merge 'Port dtest EAR tests to test.py/pytest in scylla CI' from Calle Wilund Fixes #26641 * Adds shared abstraction for dockerized mock services for out pytests (not using python docker, due to both library and podman) * Adds test fixtures for our key providers (except GCS KMS, for which we have no mock server) to do local testing * Ports (and prunes and sharpens) the test cases from dtest::encryption_at_rest_test to our pytest. * Shared KMIP mock between boost test and pytest and speeds up boost test shutdown. When merged, the dtest counterpart can be decommissioned. Closes scylladb/scylladb#26642 * github.com:scylladb/scylladb: test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper test::cluster::test_encryption: Port dtest EAR tests test::cluster::conftest: Add key_provider fixture test::pylib::encryption_provider: Port dtest encryption provider classes test::pylib::dockerized_service: Add helper for running docker/podman test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures test::boost::kmip_wrapper: Move python script for PyKMIP to pylib	2025-10-27 15:42:52 +02:00
Patryk Jędrzejczak	e1c3f666c9	Merge 'vnode cleanup: add missing barriers and fix race conditions' from Petr Gusev Problems addressed by this PR * Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up. * Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations. * `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details). * Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`: * The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited. * Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between. * It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them. * It also waited for writes with versions greater than the current `fence_version`, which is unnecessary. Fixes scylladb/scylladb#26150 backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting Closes scylladb/scylladb#26315 * https://github.com/scylladb/scylladb: test_automatic_cleanup: add test_cleanup_waits_for_stale_writes test_fencing: fix due to new version increment test_automatic_cleanup: clean it up storage_proxy: wait for closing sessions in sstable cleanup fiber storage_proxy: rename await_pending_writes -> await_stale_pending_writes storage_proxy: use run_fenceable_write storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check storage_proxy: introduce run_fenceable_write storage_proxy: move update_fence_version from shared_token_metadata storage_proxy: fix start_write() operation scope in apply_locally storage_proxy: move post fence check into handle_write storage_proxy: move fencing into mutate_counter_on_leader_and_replicate storage_proxy::handle_read: add fence check before get_schema storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber sstable_cleanup_fiber: use coroutine::parallel_for_each storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock topology_coordinator: barrier before cleanup topology_coordinator: small start_cleanup refactoring global_token_metadata_barrier: add fenced flag	2025-10-27 12:35:13 +01:00
Petr Gusev	5ab2db9613	paxos_state: inline shards_for_writes into get_replica_lock No need to have two functions since both callers of get_replica_lock() use shards_for_writes() to compute the shards where the locks must be acquired. Also while at it, inline the acquire() lambda in get_replica_lock() and replace it with a loop over shards. This makes the code more strightforward.	2025-10-27 11:12:29 +01:00
Michael Litvak	6109cb66be	cdc: garbage collect CDC streams periodically add a background fiber to the topology coordinator that runs periodically and checks for old CDC streams for tablets keyspaces that can be garbage collected.	2025-10-26 11:01:20 +01:00

1 2 3 4 5 ...

50289 Commits