scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-03 06:35:51 +00:00

Author	SHA1	Message	Date
Dario Mirovic	028de964c8	utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception Make make_bytes_ostream and make_fragmented_temporary_buffer accept writer callbacks that return utils::result_with_exception instead of forcing them to throw on error. This lets callers propagate failures by returning an error result rather than throwing an exception. Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer concepts to simplify and document the template requirements on writer callbacks. This patch does not modify the actual callbacks passed, except for the syntax changes needed for successful compilation, without changing the logic. Refs: #24567 Fixes: #25273 (cherry picked from commit `9f4344a435`)	2025-07-30 21:57:15 +02:00
Dario Mirovic	96f5bcc5be	test/cqlpy: add protocol_exception tests Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter from Scylla's metrics endpoint. These metrics are used to track protocol error count before and after each test. Add cql_with_protocol context manager utility for session creation with parameterized protocol_version value. This is used for testing connection establishment with different protocol versions, and proper disposal of successfully established sessions. The tests cover two failure scenarios: - Protocol version mismatch in test_protocol_version_mismatch which tests both supported and unsupported protocol version - Malformed frames via raw socket in _protocol_error_impl, used by several test functions, and also test_no_protocol_exceptions test to assert that the error counters never decrease during test execution, catching unintended metric resets Refs: #24567 Fixes: #25273 (cherry picked from commit `7aaeed012e`)	2025-07-30 21:56:45 +02:00
Aleksandra Martyniuk	3f93cdc61b	nodetool: repair: skip tablet keyspaces Currently, nodetool repair command repairs both vnode and tablet keyspaces if no keyspace is specified. We should use this command to repair only vnode keyspaces, but this isn't easily accessible - we have to explicitly run repair only on vnode keyspaces. nodetool repair skips tablet keyspaces unless a tablet keyspace is explicitely passed as an argument. Fixes: #24040. Closes scylladb/scylladb#24042 (cherry picked from commit `6f8b378e80`) Closes scylladb/scylladb#25152	2025-07-24 16:32:37 +03:00
Michael Litvak	a57c51b9d7	tablets: stop storage group on deallocation When a tablet transitions to a post-cleanup stage on the leaving replica we deallocate its storage group. Before the storage can be deallocated and destroyed, we must make sure it's cleaned up and stopped properly. Normally this happens during the tablet cleanup stage, when table::cleanup_table is called, so by the time we transition to the next stage the storage group is already stopped. However, it's possible that tablet cleanup did not run in some scenario: 1. The topology coordinator runs tablet cleanup on the leaving replica. 2. The leaving replica is restarted. 3. When the leaving replica starts, still in `cleanup` stage, it allocates a storage group for the tablet. 4. The topology coordinator moves to the next stage. 5. The leaving replica deallocates the storage group, but it was not stopped. To address this scenario, we always stop the storage group when deallocating it. Usually it will be already stopped and complete immediately, and otherwise it will be stopped in the background. Fixes scylladb/scylladb#24857 Fixes scylladb/scylladb#24828 Closes scylladb/scylladb#24896 (cherry picked from commit `fa24fd7cc3`) Closes scylladb/scylladb#24906	2025-07-22 11:02:28 +03:00
Pavel Emelyanov	57b24383ed	Merge '[Backport 2025.1] Improve background disposal of tablet_metadata' from Scylladb[bot] As seen in #23284, when the tablet_metadata contains many tables, even empty ones, we're seeing a long queue of seastar tasks coming from the individual destruction of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`. This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects on their owner shard by sorting them into vectors, per- owner shard. Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the contained tablet_metadata would be cleared gently. Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom and verify that it is gone with this change. Fixes #24814 Refs #23284 This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables. * Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards - (cherry picked from commit `3acca0aa63`) - (cherry picked from commit `493a2303da`) - (cherry picked from commit `e0a19b981a`) - (cherry picked from commit `2b2cfaba6e`) - (cherry picked from commit `2c0bafb934`) - (cherry picked from commit `4a3d14a031`) - (cherry picked from commit `6e4803a750`) Parent PR: #24618 Closes scylladb/scylladb#24862 * github.com:scylladb/scylladb: token_metadata_impl: clear_gently: release version tracker early test: topology_custom: test_tablets_merge: add test_tablet_split_merge_with_many_tables token_metadata: clear_and_destroy_impl when destroyed token_metadata: keep a reference to shared_token_metadata token_metadata: move make_token_metadata_ptr into shared_token_metadata class replica: database: get and expose a mutable locator::shared_token_metadata locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction	2025-07-22 11:02:00 +03:00
Patryk Jędrzejczak	f11413274d	test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when: - both writes succeeded with the same replica responding first, - one of the following reads succeeded with the other replica responding before it applied mutations from any of the writes. We fix the test by not expecting reads with CL=ONE to return a row. We also harden the test by inserting different rows for every pair (CL, coordinator), where one of the two coordinators is a normal node from DC1, and the other one is a zero-token node from DC2. This change makes sure that, for example, every write really inserts a row. Fixes scylladb/scylladb#22967 The fix addresses CI flakiness and only changes the test, so it should be backported. Closes scylladb/scylladb#23518 (cherry picked from commit `21edec1ace`) Fixing conflicts required additionally backporting the log line from scylladb/scylladb#22968. Closes scylladb/scylladb#24983	2025-07-22 11:01:18 +03:00
Pavel Emelyanov	fb20a59242	Merge '[Backport 2025.1] cdc: throw error if column doesn't exist' from Scylladb[bot] in the CDC log transformer, when creating a CDC mutation based on some base table mutation, for each value of a base column we set the value in the CDC column with the same name. When looking up the column in the CDC schema by name, we may get a null pointer if a column by that name is not found. This shouldn't happen normally because the base schema and CDC schema should be compatible, and for each base column there should be a CDC column with the same name. However, there are scenarios where the base schema and CDC schema are incompatible for a short period of time when they are being altered. When a base column is being added or dropped, we could get a base mutation with this column set, and then the CDC transformer picks up the latest CDC schema which doesn't have this column. If such thing happens, we fix the code to throw an exception instead of crashing on null pointer dereference. Currently we don't have a safer approach to handle this, but this might be changed in the future. The other alternative is dropping that data silently which we prefer not to do. Throwing an error is acceptable because this scenario most likely indicates this behavior by the user: * The user adds a new column, and start writing values to the column before the ALTER is complete. or, * The user drops a column, and continues writing values to the column while it's being dropped. Both cases might as well fail with an error because the column is not found in the base table. Fixes https://github.com/scylladb/scylladb/issues/24952 backport needed - simple fix for a node crash - (cherry picked from commit `b336f282ae`) - (cherry picked from commit `86dfa6324f`) Parent PR: #24986 Closes scylladb/scylladb#25065 * github.com:scylladb/scylladb: test: cdc: add test_cdc_with_alter cdc: throw error if column doesn't exist	2025-07-22 11:00:32 +03:00
Dawid Mędrek	739dbf0f7d	cdc: Forbid altering columns of inactive CDC log table When CDC becomes disabled on the base table, the CDC log table still exsits (cf. scylladb/scylladb@adda43edc7). If it continues to exist up to the point when CDC is re-enabled on the base table, no new log table will be created -- instead, the old olg table will be re-attached. Since we want to avoid situations when the definition of the log table has become misaligned with the definition of the base table due to actions of the user, we forbid modifying the set of columns or renaming them in CDC log tables, even when they're inactive. Validation tests are provided. (cherry picked from commit `59800b1d66`)	2025-07-21 11:42:38 +00:00
Dawid Mędrek	02108573b6	cdc: Forbid altering columns of CDC log tables directly The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643 (cherry picked from commit `20d0050f4e`)	2025-07-21 11:42:38 +00:00
Benny Halevy	7414cf1327	test: topology_custom: test_tablets_merge: add test_tablet_split_merge_with_many_tables Reproduces #23284 Currently skipped in release mode since it requires the `short_tablet_stats_refresh_interval` interval. Ref #24641 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `4a3d14a031`) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-21 10:23:31 +03:00
Benny Halevy	d70ed984cb	token_metadata: clear_and_destroy_impl when destroyed We have a lot of places in the code where a token_metadata_ptr is kept in an automatic variable and destroyed when it leaves the scope. since it's a referenced counted lw_shared_ptr, the token_metadata object is rarely destroyed in those cases, but when it is, it doesn't go through clear_gently, and in particular its tablet_metadata is not cleared gently, leading to inefficient destruction of potentially many foreign_ptr:s. This patch calls clear_and_destroy_impl that gently clears and destroys the impl object in the background using the shared_token_metadata. Fixes #13381 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2c0bafb934`)	2025-07-21 10:02:19 +03:00
Benny Halevy	e24bfb0bf3	token_metadata: keep a reference to shared_token_metadata To be used by a following patch to gently clean and destroy the token_data_impl in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2b2cfaba6e`)	2025-07-21 10:00:49 +03:00
Benny Halevy	aa8e65616e	token_metadata: move make_token_metadata_ptr into shared_token_metadata class So we can use the local shared_token_metadata instance for safe background destroy of token_metadata_impl:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `e0a19b981a`)	2025-07-21 10:00:49 +03:00
Michael Litvak	e54414ab28	test: cdc: add test_cdc_with_alter Add a test that tests adding and dropping a column to a table with CDC enabled while writing to it. (cherry picked from commit `86dfa6324f`)	2025-07-20 10:20:24 +02:00
Avi Kivity	36d2f80f38	Merge '[Backport 2025.1] storage_service: Use utils::chunked_vector to avoid big allocation' from Scylladb[bot] The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 - (cherry picked from commit `c5a136c3b5`) Parent PR: #24561 Closes scylladb/scylladb#24890 * github.com:scylladb/scylladb: storage_service: Use utils::chunked_vector to avoid big allocation utils: chunked_vector: implement erase() for single elements and ranges utils: chunked_vector: implement insert() for single-element inserts	2025-07-16 17:28:36 +03:00
Piotr Dulikowski	451cf275bf	Merge '[Backport 2025.1] Fix for cassandra role gets recreated after DROP ROLE' from Scylladb[bot] This patchset fixes regression introduced by `7e749cd848` when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user. Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm it with CL QUORUM and only then atomically create role or password. If server is started without cluster quorum we'll skip creating role or password. Fixes https://github.com/scylladb/scylladb/issues/24469 Backport: all versions since 2024.2 - (cherry picked from commit `68fc4c6d61`) - (cherry picked from commit `c96c5bfef5`) - (cherry picked from commit `2e2ba84e94`) - (cherry picked from commit `f85d73d405`) - (cherry picked from commit `d9ec746c6d`) - (cherry picked from commit `a3bb679f49`) - (cherry picked from commit `67a4bfc152`) - (cherry picked from commit `0ffddce636`) - (cherry picked from commit `5e7ac34822`) Parent PR: #24451 Closes scylladb/scylladb#24693 * github.com:scylladb/scylladb: test: auth_cluster: add test for password reset procedure auth: cache roles table scan during startup test: auth_cluster: add test for replacing default superuser test: pylib: add ability to specify default authenticator during server_start test: pylib: allow rolling restart without waiting for cql auth: split auth-v2 logic for adding default superuser password auth: split auth-v2 logic for adding default superuser role auth: ldap: fix waiting for underlying role manager auth: wait for default role creation before starting authorizer and authenticator	2025-07-16 08:08:49 +02:00
Avi Kivity	60135c8d6c	utils: chunked_vector: implement erase() for single elements and ranges Implement using std::rotate() and resize(). The elements to be erased are rotated to the end, then resized out of existence. Again we defer optimization for trivially copyable types. Unit tests are added. Needed for range_streamer with token_ranges using chunked_vector. (cherry picked from commit `d6eefce145`)	2025-07-16 07:38:55 +08:00
Avi Kivity	efad391fac	utils: chunked_vector: implement insert() for single-element inserts partition_range_compat's unwrap() needs insert if we are to use it for chunked_vector (which we do). Implement using push_back() and std::rotate(). emplace(iterator, args) is also implemented, though the benefit is diluted (it will be moved after construction). The implementation isn't optimal - if T is trivially copyable then using std::memmove() will be much faster that std::rotate(), but this complex optimization is left for later. Unit tests are added. (cherry picked from commit `5301f3d0b5`)	2025-07-16 07:38:55 +08:00
Botond Dénes	9258d0e1cf	test/cluster/test_read_repair: write 100 rows in trace test This test asserts that a read repair really happened. To ensure this happens it writes a single partition after enabling the database_apply error injection point. For some reason, the write is sometimes reordered with the error injection and the write will get replicated to both nodes and no read repair will happen, failing the test. To make the test less sensitive to such rare reordering, add a clustering column to the table and write a 100 rows. The chance of all 100 of them being reordered with the error injection should be low enough that it doesn't happen again (famous last words). Fixes: #24330 Closes scylladb/scylladb#24403 (cherry picked from commit `495f607e73`) Closes scylladb/scylladb#24972	2025-07-15 20:16:52 +03:00
Raphael S. Carvalho	c307d84925	test: fix flakiness of test_missing_data 2025.1 only is susceptible. Merge has slightly different logic in master, test had to be adjusted for 2025.1 but is flaky. Can happen two successive merges cause the merge waiting to never finish. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes scylladb/scylladb#24821 Closes scylladb/scylladb#24936	2025-07-14 14:28:05 +03:00
Marcin Maliszkiewicz	0bbc701863	test: auth_cluster: add test for password reset procedure (cherry picked from commit `aef531077b`)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	b91eee6103	test: auth_cluster: add test for replacing default superuser This test demonstrates creating custom superuser guide: https://opensource.docs.scylladb.com/stable/operating-scylla/security/create-superuser.html (cherry picked from commit `d9223b61a2`)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	a10e17106b	test: pylib: add ability to specify default authenticator during server_start Sometimes we may not want to use default cassandra role for control connection, especially when we test dropping default role. (cherry picked from commit 08bf7237f066cead133bf0cac9bba215f238070a)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	b79415a072	test: pylib: allow rolling restart without waiting for cql Waiting for CQL requires default superuser being present in db. In some cases we may delete it and still want to do rolling restart. Additionally if we need CQL we may want to wait after restart is complete (once, and not for each node). (cherry picked from commit `d9ec746c6d`)	2025-07-10 11:19:09 +02:00
Piotr Dulikowski	e59374b721	Merge '[Backport 2025.1] batchlog_manager: abort replay of a failed batch on shutdown or node down' from Scylladb[bot] When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason. Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation. backport to relevant versions since the issue can cause node shutdown to hang Fixes scylladb/scylladb#24599 - (cherry picked from commit `8d48b27062`) - (cherry picked from commit `fc5ba4a1ea`) - (cherry picked from commit `7150632cf2`) - (cherry picked from commit `74a3fa9671`) - (cherry picked from commit `a9b476e057`) - (cherry picked from commit `d7af26a437`) Parent PR: #24595 Closes scylladb/scylladb#24878 * github.com:scylladb/scylladb: test: test_batchlog_manager: batchlog replay includes cdc test: test_batchlog_manager: test batch replay when a node is down batchlog_manager: set timeout on writes batchlog_manager: abort writes on shutdown batchlog_manager: create cancellable write response handler storage_proxy: add write type parameter to mutate_internal	2025-07-09 17:23:26 +02:00
Raphael S. Carvalho	f926083fbd	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426 (cherry picked from commit `2d716f3ffe`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24875	2025-07-09 17:39:19 +03:00
Michael Litvak	ae7b0838f6	test: test_batchlog_manager: batchlog replay includes cdc Add a new test that verifies that when replaying batch mutations from the batchlog, the mutations include cdc augmentation if needed. This is done in order to verify that it works currently as expected and doesn't break in the future. (cherry picked from commit `d7af26a437`)	2025-07-08 12:32:26 +03:00
Michael Litvak	6d45cb3d5c	test: test_batchlog_manager: test batch replay when a node is down Add a test of the batchlog manager replay loop applying failed batches while some replica is down. The test reproduces an issue where the batchlog manager tries to replay a failed batch, doesn't get a response from some replica, and becomes stuck. It verifies that the batchlog manager can eventually recover from this situation and continue applying failed batches. (cherry picked from commit `a9b476e057`)	2025-07-08 12:32:26 +03:00
Botond Dénes	ffcd772a92	Merge '[Backport 2025.1] sstables/mx/writer: handle non-full prefix row keys' from Scylladb[bot] Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. - (cherry picked from commit `92b5fe8983`) - (cherry picked from commit `0753643606`) - (cherry picked from commit `b0d5462440`) - (cherry picked from commit `093d4f8d69`) - (cherry picked from commit `678deece88`) - (cherry picked from commit `64f8500367`) - (cherry picked from commit `b931145a26`) - (cherry picked from commit `3e1c50e9a7`) - (cherry picked from commit `46ff7f9c12`) - (cherry picked from commit `ebd9420687`) - (cherry picked from commit `aae212a87c`) - (cherry picked from commit `592ca789e2`) - (cherry picked from commit `edc2906892`) Parent PR: #24492 Closes scylladb/scylladb#24740 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-07-03 07:20:07 +03:00
Botond Dénes	872ed2b359	test/boost/sstable_datafile_test: add test for corrupt data * create a table with random schema * generate data: random mutations + one row with bad key * write data to sstable * check that only good data is written to sstable * check that the bad data was saved to system.corrupt_data (cherry picked from commit `edc2906892`)	2025-07-02 14:04:11 +03:00
Botond Dénes	db07141cea	test/lib/cql_assertions: introduce columns_assertions To enable targeted and optionally typed assertions against individual columns in a row. (cherry picked from commit `aae212a87c`)	2025-07-02 14:04:11 +03:00
Botond Dénes	00402cb4c5	sstables: add corrupt_data_handler to sstables::sstables Similar to how large_data_handler is handled, propagate through sstables::sstables_manager and store its owner: replica::database. Tests and tools are also patched. Mostly mechanical changes, updating constructors and patching callers. (cherry picked from commit `ebd9420687`)	2025-07-02 14:04:09 +03:00
Botond Dénes	1ad5214bdd	Merge '[Backport 2025.1] mutation: check key of inserted rows' from Scylladb[bot] Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. Fixes: https://github.com/scylladb/scylladb/issues/24506 Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters. - (cherry picked from commit `8b756ea837`) - (cherry picked from commit `ab96c703ff`) Parent PR: #24497 Closes scylladb/scylladb#24739 * github.com:scylladb/scylladb: mutation: check key of inserted rows compound: optimize is_full() for single-component types	2025-07-02 12:03:29 +03:00
Lakshmi Narayanan Sreethar	99ec69e27d	utils/big_decimal: fix scale overflow when parsing values with large exponents The exponent of a big decimal string is parsed as an int32, adjusted for the removed fractional part, and stored as an int32. When parsing values like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32 limit, and since the scale is stored as an int32, it overflows and wraps around, losing the value. This patch fixes that the by parsing the exponent as an int64 value and then adjusting it for the fractional part. The adjusted scale is then checked to see if it is still within int32 limits before storing. An exception is thrown if it is not within the int32 limits. Note that strings with exponents that exceed the int32 range, like `0.01E2147483650`, were previously not parseable as a big decimal. They are now accepted if the final adjusted scale fits within int32 limits. For the above value, unscaled_value = 1 and scale = -2147483648, so it is now accepted. This is in line with how Java's `BigDecimal` parses strings. Fixes: #24581 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#24640 (cherry picked from commit `279253ffd0`) Closes scylladb/scylladb#24691	2025-07-02 11:34:38 +03:00
Botond Dénes	864ba576c5	Merge '[Backport 2025.1] tablets: fix missing data after tablet merge ' from Scylladb[bot] Consider the following scenario: 1) let's assume tablet 0 has range [1, 5] (pre merge) 2) tablet merge happens, tablet 0 has now range [1, 10] 3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5] 4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time 5) replica service is asked to consume range [1, 10] of tablet 0 (post merge) We have two possible outcomes: With cache bypass: 1) cache reader is bypassed 2) sstable reader is created on range [1, 10] 3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10] With cache: 1) cache reader is created 2) finds partition with token 5 is cached 3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0) 4) incremental selector consumes the pre-merge sstable spanning range [1, 5] 4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached 4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed. So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read. This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets. Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution. Fixes: https://github.com/scylladb/scylladb/issues/23313 This change needs to be backported to all supported versions which implement tablet merge. - (cherry picked from commit `d0329ca370`) - (cherry picked from commit `1f9f724441`) - (cherry picked from commit `53df911145`) Parent PR: #24287 Closes scylladb/scylladb#24338 * github.com:scylladb/scylladb: replica: Fix range reads spanning sibling tablets test: add reproducer and test for mutation source refresh after merge tablets: trigger mutation source refresh on tablet count change	2025-07-02 11:28:35 +03:00
Botond Dénes	76efa77466	test/boost/memtable_test: only inject error for test table Currently the test indiscriminately injects failures into the flushes of any table, via the IO extension mechanism. The tests want to check that the node correctly handles the IO error by self isolating, however the indiscriminate IO errors can have unintended consequences when they hit raft, leading to disorderly shutdown and failure of the tests. Testing raft's resiliency to IO errors if of course worth doing, but it is not the goal of this particular test, so to avoid the fallout, the IO errors are limited to the test tables only. Fixes: https://github.com/scylladb/scylladb/issues/24637 Closes scylladb/scylladb#24638 (cherry picked from commit `ee6d7c6ad9`) Closes scylladb/scylladb#24741	2025-06-30 20:15:07 +03:00
Raphael S. Carvalho	52ae6c2aba	replica: Fix range reads spanning sibling tablets We don't guarantee that coordinators will only emit range reads that span only one tablet. Consider this scenario: 1) split is about to be finalized, barrier is executed, completes. 2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet) 3) split is committed to group0, all replicas switch storage. 4) replica-side read is executed, uses a range which spans tablets. We could fix it with two-phase split execution. Rather than pushing the complexity to higher levels, let's fix incremental selector which should be able to serve all the tokens owned by a given shard. During split execution, either of sibling tablets aren't going anywhere since it runs with state machine locked, so a single read spanning both sibling tablets works as long as the selector works across tablet boundaries. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `53df911145`)	2025-06-30 10:50:59 -03:00
Ferenc Szili	b0140cb646	test: add reproducer and test for mutation source refresh after merge This change adds a reproducer and test for the fix where the local mutation source is not always refreshed after a tablet merge. (cherry picked from commit `1f9f724441`)	2025-06-30 10:50:53 -03:00
Botond Dénes	d37efb0c08	mutation: check key of inserted rows Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to work with the changes: it has to make the schema use compact storage, otherwise the non-full changes used by this tests are rejected by the new checks. Fixes: https://github.com/scylladb/scylladb/issues/24506 (cherry picked from commit `ab96c703ff`)	2025-06-30 12:43:04 +00:00
Abhinav Jha	2b43fb9841	group0: modify `start_operation` logic to account for synchronize phase race condition In the present scenario, the bootstrapping node undergoes synchronize phase after initialization of group0, then enters post_raft phase and becomes fully ready for group0 operations. The topology coordinator is agnostic of this and issues stream ranges command as soon as the node successfully completes `join_group0`. Although for a node booting into an already upgraded cluster, the time duration for which, node remains in synchronize phase is negligible but this race condition causes trouble in a small percentage of cases, since the stream ranges operation fails and node fails to bootstrap. This commit addresses this issue and updates the error throw logic to account for this edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing error. A regression test is also added to confirm the working of this code change. The test adds a wait in synchronize phase for newly joining node and releases only after the program counter reaches the synchronize case in the `start_operation` function. Hence it indicates that in the updated code, the start_operation will wait for the node to get done with the synchronize phase instead of throwing error. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#23536 Closes scylladb/scylladb#23829 (cherry picked from commit `5ff693eff6`) Closes scylladb/scylladb#24627	2025-06-29 14:33:01 +03:00
Avi Kivity	837f3eb6c2	Merge '[Backport 2025.1] main: don't start maintenance auth service if not enabled' from Scylladb[bot] In `f96d30c2b5` we introduced the maintenance service, which is an additional instance of auth::service. But this service has a somewhat confusing 2-level startup mechanism: it's initialized with sharded<Service>::start and then auth::service::start (different method with the same name to confuse even more). When maintenance_socket was disabled (default setting), the code did only the first part of the startup. This registered a config observer but didn't create a permission_cache instance. As a result, a crash on SIGHUP when config is reloaded can occur. Fixes: https://github.com/scylladb/scylladb/issues/24528 Backport: all not eol versions since 6.0 and 2025.1 - (cherry picked from commit `97c60b8153`) - (cherry picked from commit `dd01852341`) Parent PR: #24527 Closes scylladb/scylladb#24569 * github.com:scylladb/scylladb: test: add test for live updates of permissions cache config main: don't start maintenance auth service if not enabled	2025-06-29 14:32:35 +03:00
Aleksandra Martyniuk	b7cb4dd413	test: rest_api: fix test_repair_task_progress test_repair_task_progress checks the progress of children of root repair task. However, nothing ensures that the children are already created. Wait until at least one child of a root repair task is created. Fixes: #24556. Closes scylladb/scylladb#24560 (cherry picked from commit `0deb9209a0`) Closes scylladb/scylladb#24654	2025-06-28 09:39:52 +03:00
Marcin Maliszkiewicz	d27150e1ef	test: add test for live updates of permissions cache config (cherry picked from commit `dd01852341`)	2025-06-27 16:06:03 +02:00
Pavel Emelyanov	d431d8a99c	Merge '[Backport 2025.1] memtable: ensure _flushed_memory doesn't grow above total_memory' from Scylladb[bot] `dirty_memory_manager` tracks two quantities about memtable memory usage: "real" and "unspooled" memory usage. "real" is the total memory usage (sum of `occupancy().total_space()`) by all memtable LSA regions, plus a upper-bound estimate of the size of memtable data which has already moved to the cache region but isn't evictable (merged into the cache) yet. "unspooled" is the difference between total memory usage by all memtable LSA regions, and the total flushed memory (sum of `_flushed_memory`) of memtables. `dirty_memory_manager` controls the shares of compaction and/or blocks writes when these quantities cross various thresholds. "Total flushed memory" isn't a well defined notion, since the actual consumption of memory by the same data can vary over time due to LSA compactions, and even the data present in memtable can change over the course of the flush due to removals of outdated MVCC versions. So `_flushed_memory` is merely an approximation computed by `flush_reader` based on the data passing through it. This approximation is supposed to be a conservative lower bound. In particular, `_flushed_memory` should be not greater than `occupancy().total_space()`. Otherwise, for example, "unspooled" memory could become negative (and/or wrap around) and weird things could happen. There is an assertion in `~flush_memory_accounter` which checks that `_flushed_memory < occupancy().total_space()` at the end of flush. But it can fail. Without additional treatment, the memtable reader sometimes emits data which is already deleted. (In particular, it emites rows covered by a partition tombstone in a newer MVCC version.) This data is seen by `flush_reader` and accounted in `_flushed_memory`. But this data can be garbage-collected by the `mutation_cleaner` later during the flush and decrease `total_memory` below `_flushed_memory`. There is a piece of code in `mutation_cleaner` intended to prevent that. If `total_memory` decreases during a `mutation_cleaner` run, `_flushed_memory` is lowered by the same amount, just to preserve the asserted property. (This could also make `_flushed_memory` quite inaccurate, but that's considered acceptable). But that only works if `total_memory` is decreased during that run. It doesn't work if the `total_memory` decrease (enabled by the new allocator holes made by `mutation_cleaner`'s garbage collection work) happens asynchronously (due to memory reclaim for whatever reason) after the run. This patch fixes that by tracking the decreases of `total_memory` closer to the source. Instead of relying on `mutation_cleaner` to notify the memtable if it lowers `total_memory`, the memtable itself listens for notifications about LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's estimate of flushed memory decreased by the change in `total_memory` since the beginning of flush (if it was positive), and it keeps the amount of "spooled" memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`. Fixes scylladb/scylladb#21413 Backport candidate because it fixes a crash that can happen in existing stable branches. - (cherry picked from commit `7d551f99be`) - (cherry picked from commit `975e7e405a`) Parent PR: #21638 Closes scylladb/scylladb#24602 * github.com:scylladb/scylladb: memtable: ensure _flushed_memory doesn't grow above total memory usage replica/memtable: move region_listener handlers from dirty_memory_manager to memtable	2025-06-24 10:03:50 +03:00
Piotr Dulikowski	810ce1c67a	Merge '[Backport 2025.1] test/cluster: Adjust tests to RF-rack-valid keyspaces' from Scylladb[bot] In this PR, we're adjusting most of the cluster tests so that they pass with the `rf_rack_valid_keyspaces` configuration option enabled. In most cases, the changes are straightforward and require little to no additional insight into what the tests are doing or verifying. In some, however, doing that does require a deeper understanding of the tests we're modifying. The justification for those changes and their correctness is included in the commit messages corresponding to them. Note that this PR does not cover all of the cluster tests. There are few remaining ones, but they require a bit more effort, so we delegate that work to a separate PR. I tested all of the modified tests locally with `rf_rack_valid_keyspaces` set to true, and they all passed. Fixes scylladb/scylladb#23959 Backport: we want to backport these changes to 2025.1 since that's the version where we introduced RF-rack-valid keyspaces in. Although the tests are not, by default, run with `rf_rack_valid_keyspaces` enabled yet, that will most likely change in the near future and we'll also want to backport those changes too. The reason for this is that we want to verify that Scylla works correctly even with that constraint. - (cherry picked from commit `dbb8835fdf`) - (cherry picked from commit `9281bff0e3`) - (cherry picked from commit `5b83304b38`) - (cherry picked from commit `73b22d4f6b`) - (cherry picked from commit `2882b7e48a`) - (cherry picked from commit `4c46551c6b`) - (cherry picked from commit `92f7d5bf10`) - (cherry picked from commit `5d1bb8ebc5`) - (cherry picked from commit `d3c0cd6d9d`) - (cherry picked from commit `04567c28a3`) - (cherry picked from commit `c8c28dae92`) - (cherry picked from commit `c4b32c38a3`) - (cherry picked from commit `ee96f8dcfc`) Parent PR: #23661 Closes scylladb/scylladb#24120 * github.com:scylladb/scylladb: test/{topology,topology_custom,object_store}/suite.yaml: Enable rf_rack_valid_keyspaces in suites test/topology, test/topology_custom: Disable rf_rack_valid_keyspaces in problematic tests test/topology_custom/test_tablets: Divide rack into two to adjust tests to RF-rack-validity test/topology_custom/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity test/topology_custom/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair test/topology_custom/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity test/topology_custom/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity test/topology_custom/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity test/topology_custom/test_not_enough_token_owners.py: Adjust to RF-rack-validity test/topology_custom/test_multidc.py: Adjust to RF-rack-validity test/object_store/test_backup.py: Adjust to RF-rack-validity test/topology, test/topology_custom: Adjust simple tests to RF-rack-validity	2025-06-23 09:05:46 +02:00
Michał Chojnowski	b2302d8a46	replica/memtable: move region_listener handlers from dirty_memory_manager to memtable The memtable wants to listen for changes in its `total_memory` in order to decrease its `_flushed_memory` in case some of the freed memory has already been accounted as flushed. (This can happen because the flush reader sees and accounts even outdated MVCC versions, which can be deleted and freed during the flush). Today, the memtable doesn't listen to those changes directly. Instead, some calls which can affect `total_memory` (in particular, the mutation cleaner) manually check the value of `total_memory` before and after they run, and they pass the difference to the memtable. But that's not good enough, because `total_memory` can also change outside of those manually-checked calls -- for example, during LSA compaction, which can occur anytime. This makes memtable's accounting inaccurate and can lead to unexpected states. But we already have an interface for listening to `total_memory` changes actively, and `dirty_memory_manager`, which also needs to know it, does just that. So what happens e.g. when `mutation_cleaner` runs is that `mutation_cleaner` checks the value of `total_memory` before it runs, then it runs, causing several changes to `total_memory` which are picked up by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of `total_memory` and passes the difference to `memtable`, which corrects whatever was observed by `dirty_memory_manager`. To allow memtable to modify its `_flushed_memory` correctly, we need to make `memtable` itself a `region_listener`. Also, instead of the situation where `dirty_memory_manager` receives `total_memory` change notifications from `logalloc` directly, and `memtable` fixes the manager's state later, we want to only the memtable listen for the notifications, and pass them already modified accordingl to the manager, so there is no intermediate wrong states. This patch moves the `region_listener` callbacks from the `dirty_memory_manager` to the `memtable`. It's not intended to be a functional change, just a source code refactoring. The next patch will be a functional change enabled by this. (cherry picked from commit `7d551f99be`)	2025-06-22 17:37:47 +00:00
Dawid Mędrek	5a41282bb3	test/{topology,topology_custom,object_store}/suite.yaml: Enable rf_rack_valid_keyspaces in suites Almost all of the tests have been adjusted to be able to be run with the `rf_rack_valid_keyspaces` configuration option enabled, while the rest, a minority, create nodes with it disabled. Thanks to that, we can enable it by default, so let's do that. (cherry picked from commit `ee96f8dcfc`)	2025-06-18 16:47:49 +02:00
Dawid Mędrek	6506eddcfd	test/topology, test/topology_custom: Disable rf_rack_valid_keyspaces in problematic tests Some of the tests in the test suite have proven to be more problematic in adjusting to RF-rack-validity. Since we'd like to run as many tests as possible with the `rf_rack_valid_keyspaces` configuration option enabled, let's disable it in those. In the following commit, we'll enable it by default. (cherry picked from commit `c4b32c38a3`)	2025-06-18 14:21:53 +02:00
Dawid Mędrek	a63f1737d1	test/topology_custom/test_tablets: Divide rack into two to adjust tests to RF-rack-validity Three tests in the file use a multi-DC cluster. Unfortunately, they put all of the nodes in a DC in the same rack and because of that, they fail when run with the `rf_rack_valid_keyspaces` configuration option enabled. Since the tests revolve mostly around zero-token nodes and how they affect replication in a keyspace, this change should have zero impact on them. (cherry picked from commit `c8c28dae92`)	2025-06-18 14:21:51 +02:00
Dawid Mędrek	720c80239d	test/topology_custom/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity We reduce the number of nodes and the RF values used in the test to make sure that the test can be run with the `rf_rack_valid_keyspaces` configuration option. The test doesn't seem to be reliant on the exact number of nodes, so the reduction should not make any difference. (cherry picked from commit `04567c28a3`)	2025-06-18 14:21:48 +02:00

1 2 3 4 5 ...

8437 Commits