Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.
Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.
This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.
Refs: #24567Fixes: #25273
(cherry picked from commit 9f4344a435)
Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter
from Scylla's metrics endpoint. These metrics are used to track protocol error
count before and after each test.
Add cql_with_protocol context manager utility for session creation with parameterized
protocol_version value. This is used for testing connection establishment with
different protocol versions, and proper disposal of successfully established sessions.
The tests cover two failure scenarios:
- Protocol version mismatch in test_protocol_version_mismatch which tests both supported
and unsupported protocol version
- Malformed frames via raw socket in _protocol_error_impl, used by several test functions,
and also test_no_protocol_exceptions test to assert that the error counters never decrease
during test execution, catching unintended metric resets
Refs: #24567Fixes: #25273
(cherry picked from commit 7aaeed012e)
Currently, nodetool repair command repairs both vnode and tablet keyspaces
if no keyspace is specified. We should use this command to repair
only vnode keyspaces, but this isn't easily accessible - we have to
explicitly run repair only on vnode keyspaces.
nodetool repair skips tablet keyspaces unless a tablet keyspace
is explicitely passed as an argument.
Fixes: #24040.
Closesscylladb/scylladb#24042
(cherry picked from commit 6f8b378e80)
Closesscylladb/scylladb#25152
When a tablet transitions to a post-cleanup stage on the leaving replica
we deallocate its storage group. Before the storage can be deallocated
and destroyed, we must make sure it's cleaned up and stopped properly.
Normally this happens during the tablet cleanup stage, when
table::cleanup_table is called, so by the time we transition to the next
stage the storage group is already stopped.
However, it's possible that tablet cleanup did not run in some scenario:
1. The topology coordinator runs tablet cleanup on the leaving replica.
2. The leaving replica is restarted.
3. When the leaving replica starts, still in `cleanup` stage, it
allocates a storage group for the tablet.
4. The topology coordinator moves to the next stage.
5. The leaving replica deallocates the storage group, but it was not
stopped.
To address this scenario, we always stop the storage group when
deallocating it. Usually it will be already stopped and complete
immediately, and otherwise it will be stopped in the background.
Fixesscylladb/scylladb#24857Fixesscylladb/scylladb#24828Closesscylladb/scylladb#24896
(cherry picked from commit fa24fd7cc3)
Closesscylladb/scylladb#24906
As seen in #23284, when the tablet_metadata contains many tables, even empty ones,
we're seeing a long queue of seastar tasks coming from the individual destruction of
`tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`.
This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects
on their owner shard by sorting them into vectors, per- owner shard.
Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed
arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the
contained tablet_metadata would be cleared gently.
Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom
and verify that it is gone with this change.
Fixes#24814
Refs #23284
This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables.
* Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards
- (cherry picked from commit 3acca0aa63)
- (cherry picked from commit 493a2303da)
- (cherry picked from commit e0a19b981a)
- (cherry picked from commit 2b2cfaba6e)
- (cherry picked from commit 2c0bafb934)
- (cherry picked from commit 4a3d14a031)
- (cherry picked from commit 6e4803a750)
Parent PR: #24618Closesscylladb/scylladb#24862
* github.com:scylladb/scylladb:
token_metadata_impl: clear_gently: release version tracker early
test: topology_custom: test_tablets_merge: add test_tablet_split_merge_with_many_tables
token_metadata: clear_and_destroy_impl when destroyed
token_metadata: keep a reference to shared_token_metadata
token_metadata: move make_token_metadata_ptr into shared_token_metadata class
replica: database: get and expose a mutable locator::shared_token_metadata
locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when:
- both writes succeeded with the same replica responding first,
- one of the following reads succeeded with the other replica
responding before it applied mutations from any of the writes.
We fix the test by not expecting reads with CL=ONE to return a row.
We also harden the test by inserting different rows for every pair
(CL, coordinator), where one of the two coordinators is a normal
node from DC1, and the other one is a zero-token node from DC2.
This change makes sure that, for example, every write really
inserts a row.
Fixesscylladb/scylladb#22967
The fix addresses CI flakiness and only changes the test, so it
should be backported.
Closesscylladb/scylladb#23518
(cherry picked from commit 21edec1ace)
Fixing conflicts required additionally backporting the log line
from scylladb/scylladb#22968.
Closesscylladb/scylladb#24983
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.
When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.
However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.
If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.
Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
while it's being dropped.
Both cases might as well fail with an error because the column is not
found in the base table.
Fixes https://github.com/scylladb/scylladb/issues/24952
backport needed - simple fix for a node crash
- (cherry picked from commit b336f282ae)
- (cherry picked from commit 86dfa6324f)
Parent PR: #24986Closesscylladb/scylladb#25065
* github.com:scylladb/scylladb:
test: cdc: add test_cdc_with_alter
cdc: throw error if column doesn't exist
When CDC becomes disabled on the base table, the CDC log table
still exsits (cf. scylladb/scylladb@adda43edc7).
If it continues to exist up to the point when CDC is re-enabled
on the base table, no new log table will be created -- instead,
the old olg table will be *re-attached*.
Since we want to avoid situations when the definition of the log
table has become misaligned with the definition of the base table
due to actions of the user, we forbid modifying the set of columns
or renaming them in CDC log tables, even when they're inactive.
Validation tests are provided.
(cherry picked from commit 59800b1d66)
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.
In this commit, we're restricting those operations. We also provide two
validation tests.
One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.
Fixesscylladb/scylladb#24643
(cherry picked from commit 20d0050f4e)
Reproduces #23284
Currently skipped in release mode since it requires
the `short_tablet_stats_refresh_interval` interval.
Ref #24641
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4a3d14a031)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.
This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.
Fixes#13381
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2c0bafb934)
To be used by a following patch to gently clean and destroy
the token_data_impl in the background.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2b2cfaba6e)
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e0a19b981a)
The following was seen:
```
!WARNING | scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
(inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
(inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
(inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
(inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
(inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```
Fix by using chunked_vector.
Fixes#24158
- (cherry picked from commit c5a136c3b5)
Parent PR: #24561Closesscylladb/scylladb#24890
* github.com:scylladb/scylladb:
storage_service: Use utils::chunked_vector to avoid big allocation
utils: chunked_vector: implement erase() for single elements and ranges
utils: chunked_vector: implement insert() for single-element inserts
This patchset fixes regression introduced by 7e749cd848 when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user.
Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm
it with CL QUORUM and only then atomically create role or password.
If server is started without cluster quorum we'll skip creating role or password.
Fixes https://github.com/scylladb/scylladb/issues/24469
Backport: all versions since 2024.2
- (cherry picked from commit 68fc4c6d61)
- (cherry picked from commit c96c5bfef5)
- (cherry picked from commit 2e2ba84e94)
- (cherry picked from commit f85d73d405)
- (cherry picked from commit d9ec746c6d)
- (cherry picked from commit a3bb679f49)
- (cherry picked from commit 67a4bfc152)
- (cherry picked from commit 0ffddce636)
- (cherry picked from commit 5e7ac34822)
Parent PR: #24451Closesscylladb/scylladb#24693
* github.com:scylladb/scylladb:
test: auth_cluster: add test for password reset procedure
auth: cache roles table scan during startup
test: auth_cluster: add test for replacing default superuser
test: pylib: add ability to specify default authenticator during server_start
test: pylib: allow rolling restart without waiting for cql
auth: split auth-v2 logic for adding default superuser password
auth: split auth-v2 logic for adding default superuser role
auth: ldap: fix waiting for underlying role manager
auth: wait for default role creation before starting authorizer and authenticator
Implement using std::rotate() and resize(). The elements to be erased
are rotated to the end, then resized out of existence.
Again we defer optimization for trivially copyable types.
Unit tests are added.
Needed for range_streamer with token_ranges using chunked_vector.
(cherry picked from commit d6eefce145)
partition_range_compat's unwrap() needs insert if we are to
use it for chunked_vector (which we do).
Implement using push_back() and std::rotate().
emplace(iterator, args) is also implemented, though the benefit
is diluted (it will be moved after construction).
The implementation isn't optimal - if T is trivially copyable
then using std::memmove() will be much faster that std::rotate(),
but this complex optimization is left for later.
Unit tests are added.
(cherry picked from commit 5301f3d0b5)
This test asserts that a read repair really happened. To ensure this
happens it writes a single partition after enabling the database_apply
error injection point. For some reason, the write is sometimes reordered
with the error injection and the write will get replicated to both nodes
and no read repair will happen, failing the test.
To make the test less sensitive to such rare reordering, add a
clustering column to the table and write a 100 rows. The chance of *all*
100 of them being reordered with the error injection should be low
enough that it doesn't happen again (famous last words).
Fixes: #24330Closesscylladb/scylladb#24403
(cherry picked from commit 495f607e73)
Closesscylladb/scylladb#24972
2025.1 only is susceptible. Merge has slightly different logic in
master, test had to be adjusted for 2025.1 but is flaky.
Can happen two successive merges cause the merge waiting to never
finish.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixesscylladb/scylladb#24821Closesscylladb/scylladb#24936
Sometimes we may not want to use default cassandra role for
control connection, especially when we test dropping default role.
(cherry picked from commit 08bf7237f066cead133bf0cac9bba215f238070a)
Waiting for CQL requires default superuser being present
in db. In some cases we may delete it and still want to do
rolling restart. Additionally if we need CQL we may want to
wait after restart is complete (once, and not for each node).
(cherry picked from commit d9ec746c6d)
When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason.
Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation.
backport to relevant versions since the issue can cause node shutdown to hang
Fixes scylladb/scylladb#24599
- (cherry picked from commit 8d48b27062)
- (cherry picked from commit fc5ba4a1ea)
- (cherry picked from commit 7150632cf2)
- (cherry picked from commit 74a3fa9671)
- (cherry picked from commit a9b476e057)
- (cherry picked from commit d7af26a437)
Parent PR: #24595Closesscylladb/scylladb#24878
* github.com:scylladb/scylladb:
test: test_batchlog_manager: batchlog replay includes cdc
test: test_batchlog_manager: test batch replay when a node is down
batchlog_manager: set timeout on writes
batchlog_manager: abort writes on shutdown
batchlog_manager: create cancellable write response handler
storage_proxy: add write type parameter to mutate_internal
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.
1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.
So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.
Fixes#23771.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24426
(cherry picked from commit 2d716f3ffe)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24875
Add a new test that verifies that when replaying batch mutations from
the batchlog, the mutations include cdc augmentation if needed.
This is done in order to verify that it works currently as expected and
doesn't break in the future.
(cherry picked from commit d7af26a437)
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.
The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.
It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.
(cherry picked from commit a9b476e057)
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.
Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.
Add a full-stack test which checks that rows with bad keys are correctly handled.
Fixes: https://github.com/scylladb/scylladb/issues/24489
The bug is present in all versions, has to be backported to all supported versions.
- (cherry picked from commit 92b5fe8983)
- (cherry picked from commit 0753643606)
- (cherry picked from commit b0d5462440)
- (cherry picked from commit 093d4f8d69)
- (cherry picked from commit 678deece88)
- (cherry picked from commit 64f8500367)
- (cherry picked from commit b931145a26)
- (cherry picked from commit 3e1c50e9a7)
- (cherry picked from commit 46ff7f9c12)
- (cherry picked from commit ebd9420687)
- (cherry picked from commit aae212a87c)
- (cherry picked from commit 592ca789e2)
- (cherry picked from commit edc2906892)
Parent PR: #24492Closesscylladb/scylladb#24740
* github.com:scylladb/scylladb:
test/boost/sstable_datafile_test: add test for corrupt data
sstables/mx/writer: handler rows with empty keys
test/lib/cql_assertions: introduce columns_assertions
sstables: add corrupt_data_handler to sstables::sstables
tools/scylla-sstable: make large_data_handler a local
db: introduce corrupt_data_handler
mutation: introduce frozen_mutation_fragment_v2
mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
mutation/mutation_partition_view: extract de-ser of {clustering,static} row
idl-compiler.py: generate skip() definition for enums serializers
idl: extract full_position.idl from position_in_partition.idl
db/system_keyspace: add apply_mutation()
db/system_keyspace: introduce the corrupt_data table
* create a table with random schema
* generate data: random mutations + one row with bad key
* write data to sstable
* check that only good data is written to sstable
* check that the bad data was saved to system.corrupt_data
(cherry picked from commit edc2906892)
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.
(cherry picked from commit ebd9420687)
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.
Fixes: https://github.com/scylladb/scylladb/issues/24506
Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.
- (cherry picked from commit 8b756ea837)
- (cherry picked from commit ab96c703ff)
Parent PR: #24497Closesscylladb/scylladb#24739
* github.com:scylladb/scylladb:
mutation: check key of inserted rows
compound: optimize is_full() for single-component types
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.
This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.
Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.
Fixes: #24581
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#24640
(cherry picked from commit 279253ffd0)
Closesscylladb/scylladb#24691
Consider the following scenario:
1) let's assume tablet 0 has range [1, 5] (pre merge)
2) tablet merge happens, tablet 0 has now range [1, 10]
3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5]
4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time
5) replica service is asked to consume range [1, 10] of tablet 0 (post merge)
We have two possible outcomes:
With cache bypass:
1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10]
With cache:
1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.
So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read.
This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets.
Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution.
Fixes: https://github.com/scylladb/scylladb/issues/23313
This change needs to be backported to all supported versions which implement tablet merge.
- (cherry picked from commit d0329ca370)
- (cherry picked from commit 1f9f724441)
- (cherry picked from commit 53df911145)
Parent PR: #24287Closesscylladb/scylladb#24338
* github.com:scylladb/scylladb:
replica: Fix range reads spanning sibling tablets
test: add reproducer and test for mutation source refresh after merge
tablets: trigger mutation source refresh on tablet count change
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.
Fixes: https://github.com/scylladb/scylladb/issues/24637Closesscylladb/scylladb#24638
(cherry picked from commit ee6d7c6ad9)
Closesscylladb/scylladb#24741
We don't guarantee that coordinators will only emit range reads that
span only one tablet.
Consider this scenario:
1) split is about to be finalized, barrier is executed, completes.
2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet)
3) split is committed to group0, all replicas switch storage.
4) replica-side read is executed, uses a range which spans tablets.
We could fix it with two-phase split execution. Rather than pushing the
complexity to higher levels, let's fix incremental selector which should
be able to serve all the tokens owned by a given shard. During split
execution, either of sibling tablets aren't going anywhere since it
runs with state machine locked, so a single read spanning both
sibling tablets works as long as the selector works across tablet
boundaries.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 53df911145)
This change adds a reproducer and test for the fix where the local mutation
source is not always refreshed after a tablet merge.
(cherry picked from commit 1f9f724441)
Make sure the keys are full prefixes as it is expected to be the case
for rows. At severeal occasions we have seen empty row keys make their
ways into the sstables, despite the fact that they are not allowed by
the CQL frontend. This means that such empty keys are possibly results
of memory corruption or use-after-{free,copy} errors. The source of the
corruption is impossible to pinpoint when the empty key is discovered in
the sstable. So this patch adds checks for such keys to places where
mutations are built: when building or unserializing mutations.
The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to
work with the changes: it has to make the schema use compact storage,
otherwise the non-full changes used by this tests are rejected by the
new checks.
Fixes: https://github.com/scylladb/scylladb/issues/24506
(cherry picked from commit ab96c703ff)
In the present scenario, the bootstrapping node undergoes synchronize phase after
initialization of group0, then enters post_raft phase and becomes fully ready for
group0 operations. The topology coordinator is agnostic of this and issues stream
ranges command as soon as the node successfully completes `join_group0`. Although for
a node booting into an already upgraded cluster, the time duration for which, node
remains in synchronize phase is negligible but this race condition causes trouble in a
small percentage of cases, since the stream ranges operation fails and node fails to bootstrap.
This commit addresses this issue and updates the error throw logic to account for this
edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing
error.
A regression test is also added to confirm the working of this code change. The test adds a
wait in synchronize phase for newly joining node and releases only after the program counter
reaches the synchronize case in the `start_operation` function. Hence it indicates that in the
updated code, the start_operation will wait for the node to get done with the
synchronize phase instead of throwing error.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#23536Closesscylladb/scylladb#23829
(cherry picked from commit 5ff693eff6)
Closesscylladb/scylladb#24627
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).
When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.
Fixes: https://github.com/scylladb/scylladb/issues/24528
Backport: all not eol versions since 6.0 and 2025.1
- (cherry picked from commit 97c60b8153)
- (cherry picked from commit dd01852341)
Parent PR: #24527Closesscylladb/scylladb#24569
* github.com:scylladb/scylladb:
test: add test for live updates of permissions cache config
main: don't start maintenance auth service if not enabled
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.
Wait until at least one child of a root repair task is created.
Fixes: #24556.
Closesscylladb/scylladb#24560
(cherry picked from commit 0deb9209a0)
Closesscylladb/scylladb#24654
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.
"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.
"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.
`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.
"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.
This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.
But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.
There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).
But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.
This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.
Fixes scylladb/scylladb#21413
Backport candidate because it fixes a crash that can happen in existing stable branches.
- (cherry picked from commit 7d551f99be)
- (cherry picked from commit 975e7e405a)
Parent PR: #21638Closesscylladb/scylladb#24602
* github.com:scylladb/scylladb:
memtable: ensure _flushed_memory doesn't grow above total memory usage
replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
In this PR, we're adjusting most of the cluster tests so that they pass
with the `rf_rack_valid_keyspaces` configuration option enabled. In most
cases, the changes are straightforward and require little to no additional
insight into what the tests are doing or verifying. In some, however, doing
that does require a deeper understanding of the tests we're modifying.
The justification for those changes and their correctness is included in
the commit messages corresponding to them.
Note that this PR does not cover all of the cluster tests. There are few
remaining ones, but they require a bit more effort, so we delegate that
work to a separate PR.
I tested all of the modified tests locally with `rf_rack_valid_keyspaces`
set to true, and they all passed.
Fixes scylladb/scylladb#23959
Backport: we want to backport these changes to 2025.1 since that's the version where we introduced RF-rack-valid keyspaces in. Although the tests are not, by default, run with `rf_rack_valid_keyspaces` enabled yet, that will most likely change in the near future and we'll also want to backport those changes too. The reason for this is that we want to verify that Scylla works correctly even with that constraint.
- (cherry picked from commit dbb8835fdf)
- (cherry picked from commit 9281bff0e3)
- (cherry picked from commit 5b83304b38)
- (cherry picked from commit 73b22d4f6b)
- (cherry picked from commit 2882b7e48a)
- (cherry picked from commit 4c46551c6b)
- (cherry picked from commit 92f7d5bf10)
- (cherry picked from commit 5d1bb8ebc5)
- (cherry picked from commit d3c0cd6d9d)
- (cherry picked from commit 04567c28a3)
- (cherry picked from commit c8c28dae92)
- (cherry picked from commit c4b32c38a3)
- (cherry picked from commit ee96f8dcfc)
Parent PR: #23661Closesscylladb/scylladb#24120
* github.com:scylladb/scylladb:
test/{topology,topology_custom,object_store}/suite.yaml: Enable rf_rack_valid_keyspaces in suites
test/topology, test/topology_custom: Disable rf_rack_valid_keyspaces in problematic tests
test/topology_custom/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
test/topology_custom/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
test/topology_custom/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity
test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
test/topology_custom/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity
test/topology_custom/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity
test/topology_custom/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity
test/topology_custom/test_not_enough_token_owners.py: Adjust to RF-rack-validity
test/topology_custom/test_multidc.py: Adjust to RF-rack-validity
test/object_store/test_backup.py: Adjust to RF-rack-validity
test/topology, test/topology_custom: Adjust simple tests to RF-rack-validity
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).
Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.
But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.
But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.
To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.
This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.
(cherry picked from commit 7d551f99be)
Almost all of the tests have been adjusted to be able to be run with
the `rf_rack_valid_keyspaces` configuration option enabled, while
the rest, a minority, create nodes with it disabled. Thanks to that,
we can enable it by default, so let's do that.
(cherry picked from commit ee96f8dcfc)
Some of the tests in the test suite have proven to be more problematic
in adjusting to RF-rack-validity. Since we'd like to run as many tests
as possible with the `rf_rack_valid_keyspaces` configuration option
enabled, let's disable it in those. In the following commit, we'll enable
it by default.
(cherry picked from commit c4b32c38a3)
Three tests in the file use a multi-DC cluster. Unfortunately, they put
all of the nodes in a DC in the same rack and because of that, they fail
when run with the `rf_rack_valid_keyspaces` configuration option enabled.
Since the tests revolve mostly around zero-token nodes and how they
affect replication in a keyspace, this change should have zero impact on
them.
(cherry picked from commit c8c28dae92)
We reduce the number of nodes and the RF values used in the test
to make sure that the test can be run with the `rf_rack_valid_keyspaces`
configuration option. The test doesn't seem to be reliant on the
exact number of nodes, so the reduction should not make any difference.
(cherry picked from commit 04567c28a3)