When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:
* When the stage is updated from `cleanup` to `end_migration`, the
storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
then we don't allocate a storage group for it. This happens for
example if the leaving replica is restarted during tablet migration.
If it's initialized in `cleanup` stage then we allocate a storage
group, and it will be deallocated when transitioning to
`end_migration`.
This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.
It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.
Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.
This fixes the following issue:
1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
migration is completed - it's not cleaned up properly.
Fixesscylladb/scylladb#23481
(cherry picked from commit 34f15ca871)
`chunked_managed_vector` is a vector-like container which splits
its contents into multiple contiguous allocations if necessary,
in order to fit within LSA's max preferred contiguous allocation
limits.
Each limited-size chunk is stored in a `managed_vector`.
`managed_vector` is unaware of LSA's size limits.
It's up to the user of `managed_vector` to pick a size which
is small enough.
This happens in `chunked_managed_vector::max_chunk_capacity()`.
But the calculation is wrong, because it doesn't account for
the fact that `managed_vector` has to place some metadata
(the backreference pointer) inside the allocation.
In effect, the chunks allocated by `chunked_managed_vector`
are just a tiny bit larger than the limit, and the limit is violated.
Fix this by accounting for the metadata.
Also, before the patch `chunked_managed_vector::max_contiguous_allocation`,
repeats the definition of logalloc::max_managed_object_size.
This is begging for a bug if `logalloc::max_managed_object_size`
changes one day. Adjust it so that `chunked_managed_vector` looks
directly at `logalloc::max_managed_object_size`, as it means to.
Fixesscylladb/scylladb#23854
(cherry picked from commit 7f9152babc)
Closesscylladb/scylladb#24369
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.
Make sure that all the try..catch blocks in the topology coordinator
which:
- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message
...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.
Fixes: scylladb/scylladb#22649Closesscylladb/scylladb#23962
(cherry picked from commit 156ff8798b)
Closesscylladb/scylladb#24074
Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead.
Fixes: #16095.
Needs backport to 2025.1 and 6.2 as they contain the bug
- (cherry picked from commit 91b57e79f3)
- (cherry picked from commit c1618c7de5)
Parent PR: #23876Closesscylladb/scylladb#23904
* github.com:scylladb/scylladb:
test: test table drop during flush
replica: skip flush of dropped table
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.
Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.
Fixes: #15257.
Closesscylladb/scylladb#23915
(cherry picked from commit 20c2d6210e)
Closesscylladb/scylladb#24050
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.
Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.
Fixes: scylladb/scylladb#23951Closesscylladb/scylladb#23952
(cherry picked from commit 8ffe4b0308)
Closesscylladb/scylladb#23980
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.
Fix this by adding the missing break.
Fixes: scylladb/scylladb#23897
From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.
- (cherry picked from commit 66acaa1bf8)
- (cherry picked from commit 845cedea7f)
- (cherry picked from commit 670a69007e)
Parent PR: #23914Closesscylladb/scylladb#23948
* github.com:scylladb/scylladb:
test: cluster: add test_bad_initial_token
topology coordinator: do not proceed further on invalid boostrap tokens
cdc: add sanity check for generating an empty generation
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.
Fixes: https://github.com/scylladb/scylladb/issues/22514.
Needs backport to 2025.1 and 6.2 as they contain the bug.
- (cherry picked from commit 53e0f79947)
- (cherry picked from commit e178bd7847)
Parent PR: #23787Closesscylladb/scylladb#23942
* github.com:scylladb/scylladb:
test: add test for getting tasks children
tasks: check whether a node is alive before rpc
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.
For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.
Fixes scylladb/scylladb#21637
Backport: 6.2 and 6.1
- (cherry picked from commit 60f1053087)
- (cherry picked from commit e05c082002)
Parent PR: #22779Closesscylladb/scylladb#23769
* github.com:scylladb/scylladb:
ensure raft group0 RPCs use the gossip scheduling group
Move RAFT operations verbs to GOSSIP group.
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.
(cherry picked from commit 53e0f79947)
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.
(cherry picked from commit e05c082002)
In order for RAFT operations to use the gossip system semaphore, moving RAFT
verbs to the gossip group in `do_get_rpc_client_idx`, messaging_service.
Fixes scylladb/scylladb21637
(cherry picked from commit 60f1053087)
Adds a test which checks that rollback works properly in case when a bad
value of the initial_token function is provided.
(cherry picked from commit 670a69007e)
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.
Fix this by adding the missing break.
Fixes: scylladb/scylladb#23897
(cherry picked from commit 845cedea7f)
It doesn't make sense to create an empty CDC generation because it does
not make sense to have a cluster with no tokens. Add a sanity check to
cdc::make_new_generation_description which fails if somebody attempts to
do that (i.e. when the set of current tokens + optionally bootstrapping
node's tokens is empty).
The function does not work correctly if it is misused, as we saw in
scylladb/scylladb#23897. While the function should not be misused in the
first place, it's better to throw an exception rather than crash -
especially that this crash could happen on the topology coordinator.
(cherry picked from commit 66acaa1bf8)
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated due to all `schema_ptr`s temporarily dying.
To fix this, this patch adds the base schema to the registry, alongside
the view schema. We store just the frozen base schema, so that we can
transfer it across shards. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.
In this series we also make sure that the view schemas in the registry are
kept up-to-date in regards to base schema changes.
Fixes https://github.com/scylladb/scylladb/issues/21354
This issue is a bug, so adding backport labels 6.1 and 6.2
- (cherry picked from commit 6f11edbf3f)
- (cherry picked from commit dfe3810f64)
- (cherry picked from commit 82f2e1b44c)
- (cherry picked from commit 3094ff7cbe)
- (cherry picked from commit 74cbc77f50)
Parent PR: #21862Closesscylladb/scylladb#23046
* github.com:scylladb/scylladb:
test: add test for schema registry maintaining base info for views
schema_registry: avoid setting base info when getting the schema from registry
schema_registry: update cached base schemas when updating a view
schema_registry: cache base schemas for views
db: set base info before adding schema to registry
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.
Restrict the state change to when the topology state machine is idle.
In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.
Unit test:
Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.
Fixes https://github.com/scylladb/scylladb/issues/20073.
Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.
- (cherry picked from commit e1186f0ae6)
- (cherry picked from commit 841ca652a0)
Parent PR: #23751Closesscylladb/scylladb#23768
* github.com:scylladb/scylladb:
storage_service: add unit test for mid-decommission transit_tablet()
storage_service: preserve state of busy topology when transiting tablet
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.
But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.
But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.
In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).
In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.
Consequences of the bug:
1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.
2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).
3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.
There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.
But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.
If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.
Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
This is a regression fix, should be backported to all affected releases.
- (cherry picked from commit 4e2f62143b)
- (cherry picked from commit 6c1889f65c)
Parent PR: #23782Closesscylladb/scylladb#23809
* github.com:scylladb/scylladb:
managed_bytes_test: add a reproducer for #23781
managed_bytes: in the copy constructor, respect the target preferred allocation size
Fixes#22688
If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.
Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.
v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some
Closesscylladb/scylladb#22693
(cherry picked from commit 342df0b1a8)
(Update: workaround python test objects not having dc info)
Closesscylladb/scylladb#22876
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.
But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.
But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.
In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).
In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.
Consequences of the bug:
1. Effective memory consumption of an affected cell is rounded up to the nearest
power of 2.
2. With a pathological-enough allocation pattern
(for example, one which somehow ends up placing a single 16 kiB
memtable-owned allocation in every aligned 128 kiB span),
memtable flushing could theoretically deadlock,
because the allocator might be too fragmented to let the memtable
grow by another 128 kiB segment, while keeping the sum of all
allocations small enough to avoid triggering a flush.
(Such an allocation pattern probably wouldn't happen in practice though).
3. It triggers a bug in reclaim which results in spurious
allocation failures despite ample evictable memory.
There is a path in the reclaimer procedure where we check whether
reclamation succeeded by checking that the number of free LSA
segments grew.
But in the presence of evictable non-LSA allocations, this is wrong
because the reclaim might have met its target by evicting the non-LSA
allocations, in which case memory is returned directly to the
standard allocator, rather than to the pool of free segments.
If that happens, the reclaimer wrongly returns `reclaimed_nothing`
to Seastar, which fails the allocation.
Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
(cherry picked from commit 4e2f62143b)
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.
Fixes: #23513Fixes: #23512Closesscylladb/scylladb#23558
(cherry picked from commit 7bbfa5293f)
Closesscylladb/scylladb#23793
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 841ca652a0)
Commit 876478b84f ("storage_service: allow concurrent tablet migration
in tablets/move API", 2024-02-08) introduced a code path on which the
topology state machine would be busy -- in "tablet_draining" or
"tablet_migration" state -- at the time of starting tablet migration. The
pre-commit code would unconditionally transition the topology to
"tablet_migration" state, assuming the topology had been idle previously.
On the new code path, this state change would be idempotent if the
topology state machine had been busy in "tablet_migration", but the state
change would incorrectly overwrite the "tablet_draining" state otherwise.
Restrict the state change to when the topology state machine is idle.
In addition, add the topology update to the "updates" vector with plain
push_back(). emplace_back() is not helpful here, as
topology_mutation_builder::build() cannot construct in-place, and so we
invoke the "canonical_mutation" move constructor once, either way.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit e1186f0ae6)
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.
The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.
While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.
Fixes#23603Closesscylladb/scylladb#23604
(cherry picked from commit b4d4e48381)
Closesscylladb/scylladb#23748
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.
However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).
This patch fixes this.
Fixes#23173
The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.
- (cherry picked from commit ca6bddef35)
- (cherry picked from commit f7e1695068)
Parent PR: #23174Closesscylladb/scylladb#23523
* github.com:scylladb/scylladb:
CQL Tracing: set common query parameters in a single function
transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.
Fixes: scylladb/scylladb#22919
Bug is present in all live versions, backports are required.
- (cherry picked from commit 4d8eb02b8d)
- (cherry picked from commit 7ba29ec46c)
Parent PR: #23044Closesscylladb/scylladb#23144
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.
However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.
Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```
In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.
Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.
Fixes: https://github.com/scylladb/scylladb/issues/22834.
Needs backport to all live version, as they all contain the bug
- (cherry picked from commit 876cf32e9d)
- (cherry picked from commit faf3aa13db)
- (cherry picked from commit 44748d624d)
- (cherry picked from commit 35bc1fe276)
Parent PR: #22868Closesscylladb/scylladb#23289
* github.com:scylladb/scylladb:
streaming: fix the way a reason of streaming failure is determined
streaming: save a continuation lambda
streaming: use streaming namespace in table_check.{cc,hh}
repair: streaming: move table_check.{cc,hh} to streaming
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.
To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.
A reproducer is added to the nodetool netstats crash.
Fixes: scylladb/scylladb#23394Closesscylladb/scylladb#23395
(cherry picked from commit bd8973a025)
Closesscylladb/scylladb#23475
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.
However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.
Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```
In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.
Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.
Fixes: https://github.com/scylladb/scylladb/issues/22834.
(cherry picked from commit 35bc1fe276)
In the following patches, an additional preemption point will be
added to the coroutine lambda in register_stream_mutation_fragments.
Assign a lambda to a variable to prolong the captures lifetime.
(cherry picked from commit 44748d624d)
It is possible that the permit handed in to register_inactive_read() is
already aborted (currently only possible if permit timed out).
If the permit also happens to have wait for memory, the current code
will attempt to call promise<>::set_exception() on the permit's promise
to abort its waiters. But if the permit was already aborted via timeout,
this promise will already have an exception and this will trigger an
assert. Add a separate case for checking if the permit is aborted
already. If so, treat it as immediate eviction: close the reader and
clean up.
Fixes: scylladb/scylladb#22919
(cherry picked from commit 7ba29ec46c)
Unless the test in question actually wants to test timeouts. Timeouts
will have more pronounced consequences soon and thus using
db::timeout_clock::now() becomes a sure way to make tests flaky.
To avoid this, use db::no_timeout in the tests that don't care about
timeouts.
(cherry picked from commit 4d8eb02b8d)
Fixes#22314
Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.
Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points.
Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line.
- (cherry picked from commit e6aa09e319)
- (cherry picked from commit 4aaf3df45e)
- (cherry picked from commit 00b40eada3)
- (cherry picked from commit 48fda00f12)
Parent PR: #22327Closesscylladb/scylladb#23089
* github.com:scylladb/scylladb:
tools: Add standard extensions and propagate to schema load
cql_test_env: Use add all extensions instead of inidividually
main: Move extensions adding to function
tomstone_gc: Make validate work for tools
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;
In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.
Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252
The fix will need backport to all live release.
- (cherry picked from commit c2518cdf1a)
- (cherry picked from commit 6b5b563ef7)
- (cherry picked from commit 7e600a0747)
- (cherry picked from commit d126ea09ba)
- (cherry picked from commit cb76cafb60)
- (cherry picked from commit df09b3f970)
- (cherry picked from commit e5afd9b5fb)
- (cherry picked from commit 34b18d7ef4)
- (cherry picked from commit f7938e3f8b)
- (cherry picked from commit 6c1f6427b3)
- (cherry picked from commit 0d39091df2)
Parent PR: #23255Closesscylladb/scylladb#23671
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add memtable overlap check tests
replica/table: add error injection to memtable post-flush phase
utils/error_injection: add a way to set parameters from error injection points
test/cluster: add test_data_resurrection_in_memtable.py
test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
replica/mutation_dump: don't assume cells are live
replica/database: do_apply() add error injection point
replica: improve memtable overlap checks for the cache
replica/memtable: add is_merging_to_cache()
db/row_cache: add overlap-check for cache tombstone garbage collection
mutation/mutation_compactor: copy key passed-in to consume_new_partition()
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.
This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.
This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.
A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.
Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
underlying consumer's on_end_of_stream(), so when consuming multiple
frozen mutations, the underlying's on_end_of_stream() is called for
each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().
Fixes: #20084Closesscylladb/scylladb#23657
(cherry picked from commit d67202972a)
Closesscylladb/scylladb#23693
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.
(cherry picked from commit 0d39091df2)
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.
(cherry picked from commit 6c1f6427b3)
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.
(cherry picked from commit f7938e3f8b)
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).
(cherry picked from commit e5afd9b5fb)
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.
(cherry picked from commit df09b3f970)