The test `test_mv_write_to_dead_node` currently uses a timeout of 60
seconds for remove_node, after it was increased from 30 seconds to fix
scylladb/scylladb#22953. Apparently it is still too low, and it was
observed to fail in debug mode.
Normally remove_node uses a default timeout of TOPOLOGY_TIMEOUT = 1000
seconds, but the test requires a timeout which is shorter than 5
minutes, because it is a regression test for an issue where MV updates
hold topology changes for more than 5 minutes, and we want to verify in
the test that the topology change completes in less than 5 minutes.
To resolve the issue, we set the test to skip in debug mode, because the
remove node operation is unpredictably slow, and we increase the timeout
to 180 seconds which is hopefully enough time for remove_node in
non-debug modes, and still sufficient to satisfy the test requirements.
Fixesscylladb/scylladb#22530Closesscylladb/scylladb#23833
(cherry picked from commit 5c1d24f983)
Closesscylladb/scylladb#23874
* seastar 6d8fccf14c...ed31c1ce82 (1):
> Merge 'Share IO queues between mountpoints' from Pavel Emelyanov
Changes in io-queue call for scylla-gdb update as well -- now the
reactor map of device to io-queue uses seastar::shared_ptr, not
std::unique_ptr.
Closes scylladb/scylladb#23733
Ref 70ac5828a8Fixes#23820
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.
But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.
But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.
In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).
In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.
Consequences of the bug:
1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.
2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).
3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.
There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.
But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.
If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.
Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
This is a regression fix, should be backported to all affected releases.
- (cherry picked from commit 4e2f62143b)
- (cherry picked from commit 6c1889f65c)
Parent PR: #23782Closesscylladb/scylladb#23810
* github.com:scylladb/scylladb:
managed_bytes_test: add a reproducer for #23781
managed_bytes: in the copy constructor, respect the target preferred allocation size
Said method passes down its diff input to mutate_internal(), after some std::ranges massaging. Said massaging is destructive -- it moves items from the diff. If the output range is iterated-over multiple times, only the first time will see the actual output, further iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens: mutate_internal() iterates over the range multiple times, first to log its content, then to pass it down the stack. This ends up resulting in an empty range being pased down and write handlers being created with nullopt optionals.
Fixes: scylladb/scylladb#21907Fixes: scylladb/scylladb#21714
A follow-up stability fix for the test is also included.
Fixes: https://github.com/scylladb/scylladb/issues/23513
Fixes: https://github.com/scylladb/scylladb/issues/23512
Based on code-inspection, all versions are vulnerable, although <=6.2 use boost::ranges, not std::ranges.
- (cherry picked from commit 7150442f6a)
Parent PR: #21910Closesscylladb/scylladb#23791
* github.com:scylladb/scylladb:
test/cluster/test_read_repair.py: increase read request timeout
service/storage_proxy: schedule_repair(): materialize the range into a vector
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.
In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.
Fixes: https://github.com/scylladb/scylladb/issues/17174.
Needs backport to 2025.1 to prevent out of space during streaming
- (cherry picked from commit b80e957a40)
- (cherry picked from commit ed7b8bb787)
- (cherry picked from commit 5d6041617b)
- (cherry picked from commit 4a847df55c)
- (cherry picked from commit eb17af6143)
- (cherry picked from commit acd32b24d3)
- (cherry picked from commit 372b562f5e)
Parent PR: #23187Closesscylladb/scylladb#23682
* github.com:scylladb/scylladb:
test: add test for rebuild with repair
locator: service: move to rebuild_v2 transition if cluster is upgraded
locator: service: add transition to rebuild_repair stage for rebuild_v2
locator: service: add rebuild_repair tablet transition stage
locator: add maybe_get_primary_replica
locator: service: add rebuild_v2 tablet transition kind
gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.
But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.
But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.
In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).
In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.
Consequences of the bug:
1. Effective memory consumption of an affected cell is rounded up to the nearest
power of 2.
2. With a pathological-enough allocation pattern
(for example, one which somehow ends up placing a single 16 kiB
memtable-owned allocation in every aligned 128 kiB span),
memtable flushing could theoretically deadlock,
because the allocator might be too fragmented to let the memtable
grow by another 128 kiB segment, while keeping the sum of all
allocations small enough to avoid triggering a flush.
(Such an allocation pattern probably wouldn't happen in practice though).
3. It triggers a bug in reclaim which results in spurious
allocation failures despite ample evictable memory.
There is a path in the reclaimer procedure where we check whether
reclamation succeeded by checking that the number of free LSA
segments grew.
But in the presence of evictable non-LSA allocations, this is wrong
because the reclaim might have met its target by evicting the non-LSA
allocations, in which case memory is returned directly to the
standard allocator, rather than to the pool of free segments.
If that happens, the reclaimer wrongly returns `reclaimed_nothing`
to Seastar, which fails the allocation.
Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
(cherry picked from commit 4e2f62143b)
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.
Fixes: #23513Closesscylladb/scylladb#23558
(cherry picked from commit 7bbfa5293f)
This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator.
It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes.
Need backporting to 2025.1, as RCU and WCU are not fully supported
Fixes#23690
- (cherry picked from commit 0eabf8b388)
- (cherry picked from commit 88095919d0)
- (cherry picked from commit 3acde5f904)
Parent PR: #23691Closesscylladb/scylladb#23790
* github.com:scylladb/scylladb:
test_returnconsumedcapacity.py: test RCU for batch get item
alternator/executor: Add RCU support for batch get items
alternator/consumed_capacity: make functionality public
Said method passes down its `diff` input to `mutate_internal()`, after
some std::ranges massaging. Said massaging is destructive -- it moves
items from the diff. If the output range is iterated-over multiple
times, only the first time will see the actual output, further
iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens:
`mutate_internal()` iterates over the range multiple times, first to log
its content, then to pass it down the stack. This ends up resulting in
a range with moved-from elements being pased down and consequently write
handlers being created with nullopt mutations.
Make the range re-entrant by materializing it into a vector before
passing it to `mutate_internal()`.
Fixes: scylladb/scylladb#21907Fixes: scylladb/scylladb#21714Closesscylladb/scylladb#21910
(cherry picked from commit 7150442f6a)
Currently, maybe_switch_to_new_writer resets _current_writer
only in a continuation after closing the current writer.
This leaves a window of vulnerability if close() yields,
and token_group_based_splitting_mutation_writer::close()
is called. Seeing the engaged _current_writer, close()
will call _current_writer->close() - which must be called
exactly once.
Solve this when switching to a new writer by resetting
_current_writer before closing it and potentially yielding.
Fixes#22715
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#22922
(cherry picked from commit 29b795709b)
Closesscylladb/scylladb#22965
This patch adds tests for consumed capacity in batch get item. It tests
both the simple case and the multi-item, multi-table case that combines
consistent and non-consistent reads.
(cherry picked from commit 3acde5f904)
This patch adds RCU support for batch get items. With batch requests,
multiple objects are read from multiple tables. While the criterion for
adding the units is per the batch request, the units are calculated per
table—and so is the read consistency.
(cherry picked from commit 88095919d0)
The consumed_capacity_counter is not completely applicable for batch
operations. This patch makes some of its functionality public so that
batch get item can use the components to decide if it needs to send
consumed capacity in the reply, to get the half units used by the
metrics and returned result, and to allow an empty constructor for the
RCU counter.
(cherry picked from commit 0eabf8b388)
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.
The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.
While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.
Fixes#23603Closesscylladb/scylladb#23604
(cherry picked from commit b4d4e48381)
Closesscylladb/scylladb#23749
Fixes#23225Fixes#23185
Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves
extension wrapping of file and sink objects to storage level.
(Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not).
This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail.
Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR.
Unit tests for io objects and a macro test for S3 encrypted storage included.
- (cherry picked from commit 98a6d0f79c)
- (cherry picked from commit e100af5280)
- (cherry picked from commit d46dcbb769)
- (cherry picked from commit e02be77af7)
- (cherry picked from commit 9ac9813c62)
- (cherry picked from commit 5c6337b887)
Parent PR: #23261Closesscylladb/scylladb#23424
* github.com:scylladb/scylladb:
encryption: Add "wrap_sink" to encryption sstable extension
encrypted_file_impl: Add encrypted_data_sink
sstables::storage: Move wrapping sstable components to storage provider
sstables::file_io_extension: Add a "wrap_sink" method.
sstables::file_io_extension: Make sstable argument to "wrap" const
utils: Add "io-wrappers", useful IO helper types
Adds a sibling type to encrypted file, a data_sink, that
will write a data stream in the same block format as a file
object would. Including end padding.
For making encrypted data sink writing less cumbersome.
(cherry picked from commit 9ac9813c62)
Fixes#23225Fixes#23185
Moved wrapping component files/sinks to storage provider. Also ensures
to wrap data_sinks as well as actual files. This ensures that we actually
write encryption if active.
(cherry picked from commit e02be77af7)
Similar to wrap file, should wrap a data_sink (used for
sstable writers), in obvious write-only, simple stream
mode.
Default impl will detect if we wrap files for this component,
and if so, generate a file wrapper for the input sink, wrap
this, and the wrap it in a file_data_sink_impl.
This is obviously not efficient, so extensions used in actual
non-test code should implement the method.
(cherry picked from commit d46dcbb769)
This matches the signature of call sites. Since the only "real"
extension to actually make a marker in the sstable will do so in
the scylla component, which is writable even in a const sstable,
this is ok.
(cherry picked from commit e100af5280)
Mainly to add a somewhat functional file-impl wrapping
a data_sink. This can implement a rudimentary, write-only,
file based on any output sink.
For testing, and because they fit there, place memory
sink and source types there as well.
(cherry picked from commit 98a6d0f79c)
audit_syslog_storage_helper::syslog_send_helper uses Seastar's
net::datagram_channel to write to syslog device (usually /dev/log).
However, datagram_channel.send() is not fiber-safe (ref seastar#2690),
so unserialized use of send() results in packets overwriting its state.
This, in turn, causes a corruption of audit logs, as well as assertion
failures.
To workaround the problem, a new semaphore is introduced in
audit_syslog_storage_helper. As storage_helper is a member of sharded
audit service, the semaphore allows for one datagram_channel.send() on
each shard. Each audit_syslog_storage_helper stores its own
datagram_channel, therefore concurrent sends to datagram_channel are
eliminated.
This change:
- Moved syslog_send_helper to audit_syslog_storage_helper
- Corutinize audit_syslog_storage_helper
- Introduce semaphore with count=1 in audit_syslog_storage_helper.
See https://github.com/scylladb/scylla-dtest/pull/5749 for releated dtest
Fixes: scylladb/scylladb#22973
Backport to 2025.1 should be considered, as https://github.com/scylladb/scylladb/issues/22973 is known to cause crashes of 2025.1.
- (cherry picked from commit dbd2acd2be)
- (cherry picked from commit 889fd5bc9f)
- (cherry picked from commit c12f976389)
Parent PR: #23464Closesscylladb/scylladb#23674
* github.com:scylladb/scylladb:
audit: add semaphore to audit_syslog_storage_helper
audit: corutinize audit_syslog_storage_helper
audit: moved syslog_send_helper to audit_syslog_storage_helper
If cluster is upgraded to version containing rebuild_v2 transition
kind, move to this transition kind instead of rebuild.
(cherry picked from commit acd32b24d3)
Modify write_both_read_old and streaming stages in rebuild_v2 transition
kind: write_both_read_old moves to rebuild_repair stage and streaming stage
streams data only from one replica.
(cherry picked from commit eb17af6143)
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.
rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.
The repair bypasses the tablet repair scheduler and it does not update
the repair_time.
A transition to the rebuild_repair stage will be added in the following
patches.
(cherry picked from commit 4a847df55c)
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.
To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.
The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.
(cherry picked from commit ed7b8bb787)
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.
This PR fixes the issue by updating the load balancer to no schedule any
migrations and to not make any repair plans when there a resize
finalization is pending in any table.
Also added a testcase to verify the fix.
Fixes#21762
- (cherry picked from commit 8cabc66f07)
- (cherry picked from commit 5b47d84399)
- (cherry picked from commit dccce670c1)
Parent PR: #22148Closesscylladb/scylladb#23633
* github.com:scylladb/scylladb:
topology_coordinator: fix indentation in generate_migration_updates
topology_coordinator: do not schedule migrations when there are pending resize finalizations
load_balancer: make repair plans only when there is no pending resize finalization
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.
This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.
This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.
A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.
Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
underlying consumer's on_end_of_stream(), so when consuming multiple
frozen mutations, the underlying's on_end_of_stream() is called for
each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().
Fixes: #20084Closesscylladb/scylladb#23657
(cherry picked from commit d67202972a)
Closesscylladb/scylladb#23694
Add a new nodetool cluster super-command. Add nodetool
cluster repair command to repair tablet keyspaces.
It uses the new /storage_service/tablets/repair API.
The nodetool cluster repair command allows you to specify
the keyspace and tables to be repaired. A cluster repair of many
tables will request /storage_service/tablets/repair and wait for
the result synchronously for each table.
The nodetool repair command, which was previously used to repair
keyspaces of any type, now repairs only vnode keyspaces.
Fixes: https://github.com/scylladb/scylladb/issues/22409.
Needs backport to 2025.1 that introduces the new tablet repair API
- (cherry picked from commit cbde835792)
- (cherry picked from commit b81c81c7f4)
- (cherry picked from commit aa3973c850)
- (cherry picked from commit 8bbc5e8923)
- (cherry picked from commit 02fb71da42)
- (cherry picked from commit 9769d7a564)
Parent PR: #22905Closesscylladb/scylladb#23672
* github.com:scylladb/scylladb:
docs: nodetool: update repair and add tablet-repair docs
test: nodetool: add tests for cluster repair command
nodetool: add cluster repair command
nodetool: repair: extract getting hosts and dcs to functions
nodetool: repair: warn about repairing tablet keyspaces
nodetool: repair: move keyspace_uses_tablets function
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.
Status quo (around the assert in truncate procedure):
1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.
Note: truncated_at is likely above low_mark_at due to steps 2 and 3.
The interesting part of the assert is:
(truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)
Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().
If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.
truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.
Reproducer:
To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.
If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).
So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.
Proposed solution:
The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.
We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).
But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.
Fixes#18059.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23560
(cherry picked from commit 0f59deffaa)
(cherry picked from commit 7554d4bbe09967f9b7a55575b5dfdde4f6616862)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23649
The split monitor wasn't handling the scenario where the table being
split is dropped. The monitor would be unable to find the tablet map
of such a table, and the error would be treated as a retryable one
causing the monitor to fall into an endless retry loop, with sleeps
in between. And that would block further splits, since the monitor
would be busy with the retries. The fix is about detecting table
was dropped and skipping to the next candidate, if any.
Fixes#21859.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22933
(cherry picked from commit 4d8a333a7f)
Closesscylladb/scylladb#23480
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;
In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.
Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252
The fix will need backport to all live release.
- (cherry picked from commit c2518cdf1a)
- (cherry picked from commit 6b5b563ef7)
- (cherry picked from commit 7e600a0747)
- (cherry picked from commit d126ea09ba)
- (cherry picked from commit cb76cafb60)
- (cherry picked from commit df09b3f970)
- (cherry picked from commit e5afd9b5fb)
- (cherry picked from commit 34b18d7ef4)
- (cherry picked from commit f7938e3f8b)
- (cherry picked from commit 6c1f6427b3)
- (cherry picked from commit 0d39091df2)
Parent PR: #23255Closesscylladb/scylladb#23673
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add memtable overlap check tests
replica/table: add error injection to memtable post-flush phase
utils/error_injection: add a way to set parameters from error injection points
test/cluster: add test_data_resurrection_in_memtable.py
test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
replica/mutation_dump: don't assume cells are live
replica/database: do_apply() add error injection point
replica: improve memtable overlap checks for the cache
replica/memtable: add is_merging_to_cache()
db/row_cache: add overlap-check for cache tombstone garbage collection
mutation/mutation_compactor: copy key passed-in to consume_new_partition()
sstable features indicate that an sstable has some extension, or that
some bug was fixed. They allow us to know if we can rely on certain
properties in a read sstables.
Currently, sstable features are set early in the read path (when we
read the scylla metadata file) and very late in the write path
(when we write the scylla metadata file just before sealing the sstable).
However, we happen to read features before we set them in the write path -
when we resize the bloom filter for a newly written sstable we instantiate
an index reader, and that depends on some features. As a result,
we read a disengaged optional (for the scylla metadata component) as if
it was engaged. This somehow worked so far, but fails with libstdc++
hash table implementation.
Fix it by moving storage of the features to the sstable itself, and
setting it early in the write path.
Fixes#23484Closesscylladb/scylladb#23485
(cherry picked from commit 73e4a3c581)
Closesscylladb/scylladb#23504
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.
Fixes#23222
* requires backport on top of https://github.com/scylladb/scylladb/pull/23184
- (cherry picked from commit 0fc196991a)
- (cherry picked from commit f269480f53)
- (cherry picked from commit 41f02c521d)
Parent PR: #23306Closesscylladb/scylladb#23461
* github.com:scylladb/scylladb:
main: allow abort during join_cluster
main: add checkpoint before joining cluster
storage_service: add start_sys_dist_ks
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.
However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.
Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```
In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.
Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.
Fixes: https://github.com/scylladb/scylladb/issues/22834.
Needs backport to all live version, as they all contain the bug
- (cherry picked from commit 876cf32e9d)
- (cherry picked from commit faf3aa13db)
- (cherry picked from commit 44748d624d)
- (cherry picked from commit 35bc1fe276)
Parent PR: #22868Closesscylladb/scylladb#23290
* github.com:scylladb/scylladb:
streaming: fix the way a reason of streaming failure is determined
streaming: save a continuation lambda
streaming: use streaming namespace in table_check.{cc,hh}
repair: streaming: move table_check.{cc,hh} to streaming
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.
To fix this, do not schedule tablet migrations on any tables when there
are pending resize finalizations. This ensures that migrations from the
same table and other unrelated tables do not block resize finalization.
Also added a testcase to verify the fix.
Fixes#21762
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 5b47d84399)
Do not make repair plans if any table has pending resize finalization.
This is to ensure that the finalization doesn't get delayed by reapir
tasks.
Refs #21762
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 8cabc66f07)