The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.
This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.
Thus, cancel the timer, when calling the drain method to disable
the compaction manager.
Fixes https://github.com/scylladb/scylladb/issues/24504
All versions are affected. So it's a good candidate for a backport.
Closesscylladb/scylladb#24505
(cherry picked from commit a9a53d9178)
Closesscylladb/scylladb#24585
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.
Wait until at least one child of a root repair task is created.
Fixes: #24556.
Closesscylladb/scylladb#24560
(cherry picked from commit 0deb9209a0)
Closesscylladb/scylladb#24652
The purpose of this test that the cluster is able to boot up again after
a full cluster shutdown, thus exhibiting no issues when connecting to
raft group 0 that is larger than one.
(cherry picked from commit 900a6706b8)
This change implements the ability to await superuser creation in the
function ensure_superuser_is_created(). This means that Scylla will not
be serving CQL connections until the superuser is created.
Fixes#10481
(cherry picked from commit 7008b71acc)
This change is a preparation for the next change. Moving to coroutines
makes the code more readable and easier to process.
(cherry picked from commit 04fc82620b)
This change reorganizes the way standard_role_manager startup is
handled: now the future returned by its start() function can be used to
determine when startup has finished. We use this future to ensure the
startup is finished prior to starting the CQL server.
Some clusters are created without auth, and auth is added later. The
first node to recognize that auth is needed must create the superuser.
Currently this is always on restart, but if we were to ever make it
LiveUpdate then it would not be on restart.
This suggests that we don't really need to wait during restart.
This is a preparatory commit, laying ground for implementation of a
start() function that waits for the superuser to be created. The default
implementation returns a ready future, which makes no change in the code
behavior.
(cherry picked from commit f525d4b0c1)
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.
"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.
"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.
`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.
"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.
This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.
But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.
There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).
But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.
This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.
Fixes scylladb/scylladb#21413
Backport candidate because it fixes a crash that can happen in existing stable branches.
- (cherry picked from commit 7d551f99be)
- (cherry picked from commit 975e7e405a)
Parent PR: #21638Closesscylladb/scylladb#24601
* github.com:scylladb/scylladb:
memtable: ensure _flushed_memory doesn't grow above total memory usage
replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
dirty_memory_manager tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.
"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.
"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.
dirty_memory_manager controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.
"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.
This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in ~flush_memory_accounter which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.
But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the mutation_cleaner later during the
flush and decrease `total_memory` below `_flushed_memory`.
There is a piece of code in mutation_cleaner intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).
But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.
This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.
(cherry picked from commit 975e7e405a)
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).
Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.
But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.
But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.
To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.
This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.
(cherry picked from commit 7d551f99be)
When an sstable is identified by sstable_directory as remote-unshared,
it will at some point be moved to the target shard. When it happens a
log-message appears:
sstable_directory - Moving 1 unshared SSTables to shard 1
Processing of tables by sstable_directory often happens in parallel, and
messages from sstable_directory are intermixed. Having a message like
above is not very informative, as it tells nothing about sstables that
are being moved.
Equip the message with ks:cf pair to make it more informative.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23912
(cherry picked from commit d40d6801b0)
Closesscylladb/scylladb#24014
`read_checksum()` loads the checksum component from disk and stores a
non-owning reference in the shareable components. To avoid loading the
same component twice, the function has an early return statement.
However, this does not guarantee atomicity - two fibers or threads may
load the component and update the shareable components concurrently.
This can lead to use-after-free situations when accessing the component
through the shareable components, since the reference stored there is
non-owning. This can happen when multiple compaction tasks run on the
same SSTable (e.g., regular compaction and scrub-validate).
Fix this by not updating the reference in shareable components, if a
reference is already in place. Instead, create an owning reference to
the existing component for the current fiber. This is less efficient
than using a mutex, since the component may be loaded multiple times
from disk before noticing the race, but no locks are used for any other
SSTable component either. Also, this affects uncompressed SSTables,
which are not that common.
Fixes#23728.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#23872
(cherry picked from commit eaa2ce1bb5)
Closesscylladb/scylladb#24267
In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`).
Usually responses are spread over time and actual merging is atomic.
However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another,
which leads to losing some of the results.
To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged.
Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway.
Fixes#20662Closesscylladb/scylladb#24106
(cherry picked from commit 5969809607)
Closesscylladb/scylladb#24387
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:
* When the stage is updated from `cleanup` to `end_migration`, the
storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
then we don't allocate a storage group for it. This happens for
example if the leaving replica is restarted during tablet migration.
If it's initialized in `cleanup` stage then we allocate a storage
group, and it will be deallocated when transitioning to
`end_migration`.
This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.
It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.
Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.
This fixes the following issue:
1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
migration is completed - it's not cleaned up properly.
Fixes https://github.com/scylladb/scylladb/issues/23481
backport to all relevant releases since it's a bug that results in a crash
- (cherry picked from commit 34f15ca871)
- (cherry picked from commit fb18fc0505)
- (cherry picked from commit bd88ca92c8)
Parent PR: #24393Closesscylladb/scylladb#24486
* github.com:scylladb/scylladb:
test/cluster/test_tablets: test restart during tablet cleanup
test: tablets: add get_tablet_info helper
tablets: deallocate storage state on end_migration
Add a test that reproduces issue scylladb/scylladb#23481.
The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.
This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.
(cherry picked from commit bd88ca92c8)
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:
* When the stage is updated from `cleanup` to `end_migration`, the
storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
then we don't allocate a storage group for it. This happens for
example if the leaving replica is restarted during tablet migration.
If it's initialized in `cleanup` stage then we allocate a storage
group, and it will be deallocated when transitioning to
`end_migration`.
This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.
It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.
Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.
This fixes the following issue:
1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
migration is completed - it's not cleaned up properly.
Fixesscylladb/scylladb#23481
(cherry picked from commit 34f15ca871)
`chunked_managed_vector` is a vector-like container which splits
its contents into multiple contiguous allocations if necessary,
in order to fit within LSA's max preferred contiguous allocation
limits.
Each limited-size chunk is stored in a `managed_vector`.
`managed_vector` is unaware of LSA's size limits.
It's up to the user of `managed_vector` to pick a size which
is small enough.
This happens in `chunked_managed_vector::max_chunk_capacity()`.
But the calculation is wrong, because it doesn't account for
the fact that `managed_vector` has to place some metadata
(the backreference pointer) inside the allocation.
In effect, the chunks allocated by `chunked_managed_vector`
are just a tiny bit larger than the limit, and the limit is violated.
Fix this by accounting for the metadata.
Also, before the patch `chunked_managed_vector::max_contiguous_allocation`,
repeats the definition of logalloc::max_managed_object_size.
This is begging for a bug if `logalloc::max_managed_object_size`
changes one day. Adjust it so that `chunked_managed_vector` looks
directly at `logalloc::max_managed_object_size`, as it means to.
Fixesscylladb/scylladb#23854
(cherry picked from commit 7f9152babc)
Closesscylladb/scylladb#24369
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.
Make sure that all the try..catch blocks in the topology coordinator
which:
- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message
...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.
Fixes: scylladb/scylladb#22649Closesscylladb/scylladb#23962
(cherry picked from commit 156ff8798b)
Closesscylladb/scylladb#24074
Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead.
Fixes: #16095.
Needs backport to 2025.1 and 6.2 as they contain the bug
- (cherry picked from commit 91b57e79f3)
- (cherry picked from commit c1618c7de5)
Parent PR: #23876Closesscylladb/scylladb#23904
* github.com:scylladb/scylladb:
test: test table drop during flush
replica: skip flush of dropped table
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.
Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.
Fixes: #15257.
Closesscylladb/scylladb#23915
(cherry picked from commit 20c2d6210e)
Closesscylladb/scylladb#24050
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.
Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.
Fixes: scylladb/scylladb#23951Closesscylladb/scylladb#23952
(cherry picked from commit 8ffe4b0308)
Closesscylladb/scylladb#23980
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.
Fix this by adding the missing break.
Fixes: scylladb/scylladb#23897
From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.
- (cherry picked from commit 66acaa1bf8)
- (cherry picked from commit 845cedea7f)
- (cherry picked from commit 670a69007e)
Parent PR: #23914Closesscylladb/scylladb#23948
* github.com:scylladb/scylladb:
test: cluster: add test_bad_initial_token
topology coordinator: do not proceed further on invalid boostrap tokens
cdc: add sanity check for generating an empty generation
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.
Fixes: https://github.com/scylladb/scylladb/issues/22514.
Needs backport to 2025.1 and 6.2 as they contain the bug.
- (cherry picked from commit 53e0f79947)
- (cherry picked from commit e178bd7847)
Parent PR: #23787Closesscylladb/scylladb#23942
* github.com:scylladb/scylladb:
test: add test for getting tasks children
tasks: check whether a node is alive before rpc
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.
For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.
Fixes scylladb/scylladb#21637
Backport: 6.2 and 6.1
- (cherry picked from commit 60f1053087)
- (cherry picked from commit e05c082002)
Parent PR: #22779Closesscylladb/scylladb#23769
* github.com:scylladb/scylladb:
ensure raft group0 RPCs use the gossip scheduling group
Move RAFT operations verbs to GOSSIP group.
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.
(cherry picked from commit 53e0f79947)
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.
(cherry picked from commit e05c082002)
In order for RAFT operations to use the gossip system semaphore, moving RAFT
verbs to the gossip group in `do_get_rpc_client_idx`, messaging_service.
Fixes scylladb/scylladb21637
(cherry picked from commit 60f1053087)
Adds a test which checks that rollback works properly in case when a bad
value of the initial_token function is provided.
(cherry picked from commit 670a69007e)
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.
Fix this by adding the missing break.
Fixes: scylladb/scylladb#23897
(cherry picked from commit 845cedea7f)
It doesn't make sense to create an empty CDC generation because it does
not make sense to have a cluster with no tokens. Add a sanity check to
cdc::make_new_generation_description which fails if somebody attempts to
do that (i.e. when the set of current tokens + optionally bootstrapping
node's tokens is empty).
The function does not work correctly if it is misused, as we saw in
scylladb/scylladb#23897. While the function should not be misused in the
first place, it's better to throw an exception rather than crash -
especially that this crash could happen on the topology coordinator.
(cherry picked from commit 66acaa1bf8)
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated due to all `schema_ptr`s temporarily dying.
To fix this, this patch adds the base schema to the registry, alongside
the view schema. We store just the frozen base schema, so that we can
transfer it across shards. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.
In this series we also make sure that the view schemas in the registry are
kept up-to-date in regards to base schema changes.
Fixes https://github.com/scylladb/scylladb/issues/21354
This issue is a bug, so adding backport labels 6.1 and 6.2
- (cherry picked from commit 6f11edbf3f)
- (cherry picked from commit dfe3810f64)
- (cherry picked from commit 82f2e1b44c)
- (cherry picked from commit 3094ff7cbe)
- (cherry picked from commit 74cbc77f50)
Parent PR: #21862Closesscylladb/scylladb#23046
* github.com:scylladb/scylladb:
test: add test for schema registry maintaining base info for views
schema_registry: avoid setting base info when getting the schema from registry
schema_registry: update cached base schemas when updating a view
schema_registry: cache base schemas for views
db: set base info before adding schema to registry
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.
Restrict the state change to when the topology state machine is idle.
In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.
Unit test:
Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.
Fixes https://github.com/scylladb/scylladb/issues/20073.
Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.
- (cherry picked from commit e1186f0ae6)
- (cherry picked from commit 841ca652a0)
Parent PR: #23751Closesscylladb/scylladb#23768
* github.com:scylladb/scylladb:
storage_service: add unit test for mid-decommission transit_tablet()
storage_service: preserve state of busy topology when transiting tablet
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.
But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.
But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.
In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).
In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.
Consequences of the bug:
1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.
2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).
3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.
There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.
But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.
If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.
Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
This is a regression fix, should be backported to all affected releases.
- (cherry picked from commit 4e2f62143b)
- (cherry picked from commit 6c1889f65c)
Parent PR: #23782Closesscylladb/scylladb#23809
* github.com:scylladb/scylladb:
managed_bytes_test: add a reproducer for #23781
managed_bytes: in the copy constructor, respect the target preferred allocation size
Fixes#22688
If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.
Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.
v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some
Closesscylladb/scylladb#22693
(cherry picked from commit 342df0b1a8)
(Update: workaround python test objects not having dc info)
Closesscylladb/scylladb#22876
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.
But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.
But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.
In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).
In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.
Consequences of the bug:
1. Effective memory consumption of an affected cell is rounded up to the nearest
power of 2.
2. With a pathological-enough allocation pattern
(for example, one which somehow ends up placing a single 16 kiB
memtable-owned allocation in every aligned 128 kiB span),
memtable flushing could theoretically deadlock,
because the allocator might be too fragmented to let the memtable
grow by another 128 kiB segment, while keeping the sum of all
allocations small enough to avoid triggering a flush.
(Such an allocation pattern probably wouldn't happen in practice though).
3. It triggers a bug in reclaim which results in spurious
allocation failures despite ample evictable memory.
There is a path in the reclaimer procedure where we check whether
reclamation succeeded by checking that the number of free LSA
segments grew.
But in the presence of evictable non-LSA allocations, this is wrong
because the reclaim might have met its target by evicting the non-LSA
allocations, in which case memory is returned directly to the
standard allocator, rather than to the pool of free segments.
If that happens, the reclaimer wrongly returns `reclaimed_nothing`
to Seastar, which fails the allocation.
Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
(cherry picked from commit 4e2f62143b)
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.
Fixes: #23513Fixes: #23512Closesscylladb/scylladb#23558
(cherry picked from commit 7bbfa5293f)
Closesscylladb/scylladb#23793
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 841ca652a0)
Commit 876478b84f ("storage_service: allow concurrent tablet migration
in tablets/move API", 2024-02-08) introduced a code path on which the
topology state machine would be busy -- in "tablet_draining" or
"tablet_migration" state -- at the time of starting tablet migration. The
pre-commit code would unconditionally transition the topology to
"tablet_migration" state, assuming the topology had been idle previously.
On the new code path, this state change would be idempotent if the
topology state machine had been busy in "tablet_migration", but the state
change would incorrectly overwrite the "tablet_draining" state otherwise.
Restrict the state change to when the topology state machine is idle.
In addition, add the topology update to the "updates" vector with plain
push_back(). emplace_back() is not helpful here, as
topology_mutation_builder::build() cannot construct in-place, and so we
invoke the "canonical_mutation" move constructor once, either way.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit e1186f0ae6)
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.
The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.
While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.
Fixes#23603Closesscylladb/scylladb#23604
(cherry picked from commit b4d4e48381)
Closesscylladb/scylladb#23748
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.
However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).
This patch fixes this.
Fixes#23173
The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.
- (cherry picked from commit ca6bddef35)
- (cherry picked from commit f7e1695068)
Parent PR: #23174Closesscylladb/scylladb#23523
* github.com:scylladb/scylladb:
CQL Tracing: set common query parameters in a single function
transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.
Fixes: scylladb/scylladb#22919
Bug is present in all live versions, backports are required.
- (cherry picked from commit 4d8eb02b8d)
- (cherry picked from commit 7ba29ec46c)
Parent PR: #23044Closesscylladb/scylladb#23144
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.
However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.
Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```
In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.
Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.
Fixes: https://github.com/scylladb/scylladb/issues/22834.
Needs backport to all live version, as they all contain the bug
- (cherry picked from commit 876cf32e9d)
- (cherry picked from commit faf3aa13db)
- (cherry picked from commit 44748d624d)
- (cherry picked from commit 35bc1fe276)
Parent PR: #22868Closesscylladb/scylladb#23289
* github.com:scylladb/scylladb:
streaming: fix the way a reason of streaming failure is determined
streaming: save a continuation lambda
streaming: use streaming namespace in table_check.{cc,hh}
repair: streaming: move table_check.{cc,hh} to streaming
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.
To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.
A reproducer is added to the nodetool netstats crash.
Fixes: scylladb/scylladb#23394Closesscylladb/scylladb#23395
(cherry picked from commit bd8973a025)
Closesscylladb/scylladb#23475
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.
However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.
Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```
In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.
Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.
Fixes: https://github.com/scylladb/scylladb/issues/22834.
(cherry picked from commit 35bc1fe276)
In the following patches, an additional preemption point will be
added to the coroutine lambda in register_stream_mutation_fragments.
Assign a lambda to a variable to prolong the captures lifetime.
(cherry picked from commit 44748d624d)
It is possible that the permit handed in to register_inactive_read() is
already aborted (currently only possible if permit timed out).
If the permit also happens to have wait for memory, the current code
will attempt to call promise<>::set_exception() on the permit's promise
to abort its waiters. But if the permit was already aborted via timeout,
this promise will already have an exception and this will trigger an
assert. Add a separate case for checking if the permit is aborted
already. If so, treat it as immediate eviction: close the reader and
clean up.
Fixes: scylladb/scylladb#22919
(cherry picked from commit 7ba29ec46c)
Unless the test in question actually wants to test timeouts. Timeouts
will have more pronounced consequences soon and thus using
db::timeout_clock::now() becomes a sure way to make tests flaky.
To avoid this, use db::no_timeout in the tests that don't care about
timeouts.
(cherry picked from commit 4d8eb02b8d)
Fixes#22314
Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.
Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points.
Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line.
- (cherry picked from commit e6aa09e319)
- (cherry picked from commit 4aaf3df45e)
- (cherry picked from commit 00b40eada3)
- (cherry picked from commit 48fda00f12)
Parent PR: #22327Closesscylladb/scylladb#23089
* github.com:scylladb/scylladb:
tools: Add standard extensions and propagate to schema load
cql_test_env: Use add all extensions instead of inidividually
main: Move extensions adding to function
tomstone_gc: Make validate work for tools
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;
In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.
Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252
The fix will need backport to all live release.
- (cherry picked from commit c2518cdf1a)
- (cherry picked from commit 6b5b563ef7)
- (cherry picked from commit 7e600a0747)
- (cherry picked from commit d126ea09ba)
- (cherry picked from commit cb76cafb60)
- (cherry picked from commit df09b3f970)
- (cherry picked from commit e5afd9b5fb)
- (cherry picked from commit 34b18d7ef4)
- (cherry picked from commit f7938e3f8b)
- (cherry picked from commit 6c1f6427b3)
- (cherry picked from commit 0d39091df2)
Parent PR: #23255Closesscylladb/scylladb#23671
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add memtable overlap check tests
replica/table: add error injection to memtable post-flush phase
utils/error_injection: add a way to set parameters from error injection points
test/cluster: add test_data_resurrection_in_memtable.py
test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
replica/mutation_dump: don't assume cells are live
replica/database: do_apply() add error injection point
replica: improve memtable overlap checks for the cache
replica/memtable: add is_merging_to_cache()
db/row_cache: add overlap-check for cache tombstone garbage collection
mutation/mutation_compactor: copy key passed-in to consume_new_partition()
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.
This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.
This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.
A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.
Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
underlying consumer's on_end_of_stream(), so when consuming multiple
frozen mutations, the underlying's on_end_of_stream() is called for
each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().
Fixes: #20084Closesscylladb/scylladb#23657
(cherry picked from commit d67202972a)
Closesscylladb/scylladb#23693
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.
(cherry picked from commit 0d39091df2)
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.
(cherry picked from commit 6c1f6427b3)
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.
(cherry picked from commit f7938e3f8b)
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).
(cherry picked from commit e5afd9b5fb)
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.
(cherry picked from commit df09b3f970)
So writes (to user tables) can be failed on a replica, via error
injection. Should simplify tests which want to create differences in
what writes different replicas receive.
(cherry picked from commit cb76cafb60)
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.
To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.
(cherry picked from commit d126ea09ba)
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.
(cherry picked from commit 6b5b563ef7)
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.
This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).
The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.
(cherry picked from commit c2518cdf1a)
`safe_foreach_sstable` doesn't do its job correctly.
It iterates over an sstable set under the sstable deletion
lock in an attempt to ensure that SSTables aren't deleted during the iteration.
The thing is, it takes the deletion lock after the SSTable set is
already obtained, so SSTables might get unlinked *before* we take the lock.
Remove this function and fix its usages to obtain the set and iterate
over it under the lock.
Closesscylladb/scylladb#23397
(cherry picked from commit e23fdc0799)
Closesscylladb/scylladb#23627
The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.
Fixes#23348
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#23579
(cherry picked from commit 750f4baf44)
Closesscylladb/scylladb#23644
The "make-pr-ready-for-review" workflow was failing with an "Input
required and not supplied: token" error. This was due to GitHub Actions
security restrictions preventing access to the token when the workflow
is triggered in a fork:
```
Error: Input required and not supplied: token
```
This commit addresses the issue by:
- Running the workflow in the base repository instead of the fork. This
grants the workflow access to the required token with write permissions.
- Simplifying the workflow by using a job-level `if` condition to
controlexecution, as recommended in the GitHub Actions documentation
(https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/using-conditions-to-control-job-execution).
This is cleaner than conditional steps.
- Removing the repository checkout step, as the source code is not required for this workflow.
This change resolves the token error and ensures the
"make-pr-ready-for-review" workflow functions correctly.
Fixesscylladb/scylladb#22765
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22766
(cherry picked from commit ca832dc4fb)
Closesscylladb/scylladb#23617
before this change, we specify the KillMode of the scylla-service
service unit explicitly to "process". according to
according to
https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html,
> If set to process, only the main process itself is killed (not recommended!).
and the document suggests use "control-group" over "process".
but scylla server is not a multi-process server, it is a multi-threaded
server. so it should not make any difference even if we switch to
the recommended "control-group".
in the light that we've been seeing "defunct" scylla process after
stopping the scylla service using systemd. we are wondering if we should
try to change the `KillMode` to "control-group", which is the default
value of this setting.
in this change, we just drop the setting so that the systemd stops the
service by stopping all processes in the control group of this unit
are stopped.
Fixesscylladb/scylladb#21507
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 961a53f716)
Closesscylladb/scylladb#23176
Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters
in their payload which we always want to record in the Tracing object.
Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp.
Setting each of them individually can lead to a human error when one (or more) of
them would not be set. Let's eliminate such a possibility by defining
a single function that sets them all.
This also allows an easy addition of such parameters to this function in
the future.
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause)
can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of
QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.
However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation.
For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to
set it back then).
This patch fixes this.
Fixes#23173
(cherry picked from commit ca6bddef35)
Moving a PR out of draft is only allowed to users with write access,
adding a github action to switch PR to `ready for review` once the
`conflicts` label was removed
Closesscylladb/scylladb#22446
(cherry picked from commit ed4bfad5c3)
Closesscylladb/scylladb#23006
In commit 4812a57f, the fmt-based formatter for gossip_digest_syn had
formatting code for cluster_id, partitioner, and group0_id
accidentally commented out, preventing these fields from being included
in the output. This commit restores the formatting by uncommenting the
code, ensuring full visibility of all fields in the gossip_digest_syn
message when logging permits.
This fixes a regression introduced in 4812a57f, which obscured these
fields and reduced debugging insight. Backporting is recommended for
improved observability.
Fixes#23142
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23155
(cherry picked from commit 2a9966a20e)
Closesscylladb/scylladb#23198
Fix UBSan abort caused by integer overflow when calculating time difference
between read and write operations. The issue occurs when:
1. The queried partition on replicas is not purgeable (has no recorded
modified time)
2. Digests don't match across replicas
3. The system attempts to calculate timespan using missing/negative
last_modified timestamps
This change skips cross-DC repair optimization when write timestamp is
negative or missing, as this optimization is only relevant for reads
occurring within write_timeout of a write.
Error details:
```
service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80
Aborting on shard 1, in scheduling group sl:default
```
Related to previous fix 39325cf which handled negative read_timestamp cases.
Fixes#23314
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23359
(cherry picked from commit ebf9125728)
Closesscylladb/scylladb#23386
The test fails sporadically with:
cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
That's becase a server is stopped in the middle of the workload.
The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.
Fixes#20492Closesscylladb/scylladb#23451
(cherry picked from commit 8e506c5a8f)
Closesscylladb/scylladb#23468
This commit adds documentation for zero-token nodes and an explanation
of how to use them to set up an arbiter DC to prevent a quorum loss
in multi-DC deployments.
The commit adds two documents:
- The one in Architecture describes zero-token nodes.
- The other in Cluster Management explains how to use them.
We need separate documents because zero-token nodes may be used
for other purposes in the future.
In addition, the documents are cross-linked, and the link is added
to the Create a ScyllaDB Cluster - Multi Data Centers (DC) document.
Refs https://github.com/scylladb/scylladb/pull/19684
Fixes https://github.com/scylladb/scylladb/issues/20294Closesscylladb/scylladb#21348
(cherry picked from commit 9ac0aa7bba)
Closesscylladb/scylladb#23200
The test `test_mv_topology_change` is a regression test for
scylladb/scylladb#19529. The problem was that CL=ANY writes issued when
all replicas were down would be kept in memory until the timeout. In
particular, MV updates are CL=ANY writes and have a 5 minute timeout.
When doing topology operations for vnodes or when migrating tablet
replicas, the cluster goes through stages where the replica sets for
writes undergo changes, and the writes started with the old replica set
need to be drained first.
Because of the aforementioned MV updates, the removenode operation could
be delayed by 5 minutes or more. Therefore, the
`test_mv_topology_change` test uses a short timeout for the removenode
operation, i.e. 30s. Apparently, this is too low for the debug mode and
the test has been observed to time out even though the removenode
operation is progressing fine.
Increase the timeout to 60s. This is the lowest timeout for the
removenode operation that we currently use among the in-repo tests, and
is lower than 5 minutes so the test will still serve its purpose.
Fixes: scylladb/scylladb#22953Closesscylladb/scylladb#22958
(cherry picked from commit 43ae3ab703)
Closesscylladb/scylladb#23052
In this patch we test the behavior of schema registry in a few
scenarios where it was identified it could misbehave.
The first one is reverse schemas for views. Previously, SELECT
queries with reverse order on views could fail because we didn't
have base info in the registry for such schemas.
The second one is schemas that temporarily died in the registry.
This can happen when, while processing a query for a given schema
version, all related schema_ptrs were destroyed, but this schema
was requested before schema_registry::grace_period() has passed.
In this scenario, the base info would not be recovered, causing
errors.
(cherry picked from commit 74cbc77f50)
After the previous patches, the view schemas returned by schema registry
always have their base info set. As such, we no longer need to set it after
getting the view schema from the registry. This patch removes these
unnecessary updates.
(cherry picked from commit 3094ff7cbe)
Fixes#22314
Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.
(cherry picked from commit 48fda00f12)
The schema registry now holds base schemas for view schemas.
The base schema may change without changing the view schema, so to
preserve the change in the schema registry, we also update the
base schema in the registry when updating the base info in the
view schema.
(cherry picked from commit 82f2e1b44c)
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated.
To fix this, this patch adds the base schema to the registry, alongside
the view schema. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.
To store the base schema, the loader methods now have to return the base
schema alongside the view schema. At the same time, when loading into
the registry, we need to check whether we're loading a view schema, and if
so, we need to also provide the base schema. When inserting a regular table
schema, the base schema should be a disengaged optional.
(cherry picked from commit dfe3810f64)
In the following patches, we'll assure that view schemas returned by the
schema registry always have base info set. To prepare for that, make sure
that the base info is always set before inserting it into schema registry,
(cherry picked from commit 6f11edbf3f)
Currently, maybe_switch_to_new_writer resets _current_writer
only in a continuation after closing the current writer.
This leaves a window of vulnerability if close() yields,
and token_group_based_splitting_mutation_writer::close()
is called. Seeing the engaged _current_writer, close()
will call _current_writer->close() - which must be called
exactly once.
Solve this when switching to a new writer by resetting
_current_writer before closing it and potentially yielding.
Fixes#22715
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#22922
(cherry picked from commit 29b795709b)
Closesscylladb/scylladb#22964
`set_notify_handler()` is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature eviction.
Fixes: https://github.com/scylladb/scylladb/issues/22629
Backport required to all active releases, they are all affected.
- (cherry picked from commit a3ae0c7cee)
- (cherry picked from commit 9174f27cc8)
Parent PR: #22701Closesscylladb/scylladb#22751
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: set_notify_handler(): disable timeout
reader_permit: mark check_abort() as const
set_notify_handler() is called after a querier was inserted into the
querier cache. It has two purposes: set a callback for eviction and set
a TTL for the cache entry. This latter was not disabling the
pre-existing timeout of the permit (if any) and this would lead to
premature eviction of the cache entry if the timeout was shorter than
TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature
eviction.
Fixes: #scylladb/scylladb#22629
(cherry picked from commit 9174f27cc8)
The code currently assumes that a session has both sender and receiver
streams, but it is possible to have just one or the other.
Change the test to include this scenario and remove this assumption from
the code.
Fixes: #22770Closesscylladb/scylladb#22771
(cherry picked from commit 87e8e00de6)
Closesscylladb/scylladb#22873
Said method passes down its `diff` input to `mutate_internal()`, after
some std::ranges massaging. Said massaging is destructive -- it moves
items from the diff. If the output range is iterated-over multiple
times, only the first time will see the actual output, further
iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens:
`mutate_internal()` iterates over the range multiple times, first to log
its content, then to pass it down the stack. This ends up resulting in
a range with moved-from elements being pased down and consequently write
handlers being created with nullopt mutations.
Make the range re-entrant by materializing it into a vector before
passing it to `mutate_internal()`.
Fixes: scylladb/scylladb#21907Fixes: scylladb/scylladb#21714Closesscylladb/scylladb#21910
(cherry picked from commit 7150442f6a)
Closesscylladb/scylladb#22853
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.
Fixes: #22620Closesscylladb/scylladb#22622
(cherry picked from commit 7ce932ce01)
Closesscylladb/scylladb#22718
with_permit() creates a permit, with a self-reference, to avoid
attaching a continuation to the permit's run function. This
self-reference is used to keep the permit alive, until the execution
loop processes it. This self reference has to be carefully cleared on
error-paths, otherwise the permit will become a zombie, effectively
leaking memory.
Instead of trying to handle all loose ends, get rid of this
self-reference altogether: ask caller to provide a place to save the
permit, where it will survive until the end of the call. This makes the
call-site a little bit less nice, but it gets rid of a whole class of
possible bugs.
Fixes: #22588Closesscylladb/scylladb#22624
(cherry picked from commit f2d5819645)
Closesscylladb/scylladb#22703
When a replica get a write request it performs get_schema_for_write,
which waits until the schema is synced. However, database::add_column_family
marks a schema as synced before the table is added. Hence, the write may
see the schema as synced, but hit no_such_column_family as the table
hasn't been added yet.
Mark schema as synced after the table is added to database::_tables_metadata.
Fixes: #22347.
Closesscylladb/scylladb#22348
(cherry picked from commit 328818a50f)
Closesscylladb/scylladb#22603
If start_time/end_time is unspecified for a task, task_manager API
returns epoch. Nodetool prints the value in task status.
Fix nodetool tasks commands to print empty string for start_time/end_time
if it isn't specified.
Modify nodetool tasks status docs to show empty end_time.
Fixes: #22373.
Closesscylladb/scylladb#22370
(cherry picked from commit 477ad98b72)
Closesscylladb/scylladb#22600
`tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups.
The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that
`tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but
`tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE
Fixes#22431
This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1
- (cherry picked from commit 24e8d2a55c)
- (cherry picked from commit 8bff7786a8)
Parent PR: #22330Closesscylladb/scylladb#22559
* github.com:scylladb/scylladb:
test: add reproducer and test for fix to split ready CG creation
table: run set_split_mode() on all storage groups during all_storage_groups_split()
Currently, when the status of a task is queried and the task is already finished,
it gets unregistered. Getting the status shouldn't be a one-time operation.
Stop removing the task after its status is queried. Adjust tests not to rely
on this behavior. Add task_manager/drain API and nodetool tasks drain
command to remove finished tasks in the module.
Fixes: https://github.com/scylladb/scylladb/issues/21388.
It's a fix to task_manager API, should be backported to all branches
- (cherry picked from commit e37d1bcb98)
- (cherry picked from commit 18cc79176a)
Parent PR: #22310Closesscylladb/scylladb#22597
* github.com:scylladb/scylladb:
api: task_manager: do not unregister tasks on get_status
api: task_manager: add /task_manager/drain
This series exposes a Clock template parameter for loading_cache so that the test could use
the manual_clock rather than the lowres_clock, since relying on the latter is flaky.
In addition, the test load function is simplified to sleep some small random time and co_return the expected string,
rather than reading it from a real file, since the latter's timing might also be flaky, and it out-of-scope for this test.
Fixes#20322
* The test was flaky forever, so backport is required for all live versions.
- (cherry picked from commit b509644972)
- (cherry picked from commit 934a9d3fd6)
- (cherry picked from commit d68829243f)
- (cherry picked from commit b258f8cc69)
- (cherry picked from commit 0841483d68)
- (cherry picked from commit 32b7cab917)
Parent PR: #22064Closesscylladb/scylladb#22640
* github.com:scylladb/scylladb:
tests: loading_cache_test: use manual_clock
utils: loading_cache: make clock_type a template parameter
test: loading_cache_test: use function-scope loader
test: loading_cache_test: simlute loader using sleep
test: lib: eventually: add sleep function param
test: lib: eventually: make *EVENTUALLY_EQUAL inline functions
Relying on a real-time clock like lowres_clock
can be flaky (in particular in debug mode).
Use manual_clock instead to harden the test against
timing issues.
Fixes#20322
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 32b7cab917)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So the unit test can use manual_clock rather than lowres_clock
which can be flaky (in particular in debug mode).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0841483d68)
Rather than a global function, accessing a thread-local `load_count`.
The thread-local load_count cannot be used when multiple test
cases run in parallel.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b258f8cc69)
This test isn't about reading values from file,
but rather it's about the loading_cache.
Reading from the file can sometimes take longer than
the expected refresh times, causing flakiness (see #20322).
Rather than reading a string from a real file, just
sleep a random, short time, and co_return the string.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d68829243f)
rather then macros.
This is a first cleanup step before adding a sleep function
parameter to support also manual_clock.
Also, add a call to BOOST_REQUIRE_EQUAL/BOOST_CHECK_EQUAL,
respectively, to make an error more visible in the test log
since those entry points print the offending values
when not equal.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b509644972)
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.
A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.
The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.
To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.
Fixesscylladb/scylladb#21829Closesscylladb/scylladb#22493
(cherry picked from commit 6d34125eb7)
Closesscylladb/scylladb#22606
Currently, /task_manager/task_status_recursive/{task_id} and
/task_manager/task_status/{task_id} unregister queries task if it
has already finished.
The status should not disappear after being queried. Do not unregister
finished task when its status or recursive status is queried.
(cherry picked from commit 18cc79176a)
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.
Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.
(cherry picked from commit e37d1bcb98)
In main.cc storage_service is started before and stopped after
repair_service. storage_service keeps a reference to sharded
repair_service and calls its methods, but nothing ensures that
repair_service's local instance would be alive for the whole
execution of the method.
Add a gate to repair_service and enter it in storage_service
before executing methods on local instances of repair_service.
Fixes: #21964.
Closesscylladb/scylladb#22145
(cherry picked from commit 32ab58cdea)
Closesscylladb/scylladb#22318
Said fields in statistics are of type
`disk_array<uint32_t, disk_string<uint16_t>>` and currently are handled
as array of regular strings. However these fields store exploded
clustering keys, so the elements store binary data and converting to
string can yield invalid UTF-8 characters that certain JSON parsers (jq,
or python's json) can choke on. Fix this by treating them as binary and
using `to_hex()` to convert them to string. This requires some massaging
of the json_dumper: passing field offset to all visit() methods and
using a caller-provided disk-string to sstring converter to convert disk
strings to sstring, so in the case of statistics, these fields can be
intercepted and properly handled.
While at it, the type of these fields is also fixed in the
documentation.
Before:
"min_column_names": [
"��Z���\u0011�\u0012ŷ4^��<",
"�2y\u0000�}\u007f"
],
"max_column_names": [
"��Z���\u0011�\u0012ŷ4^��<",
"}��B\u0019l%^"
],
After:
"min_column_names": [
"9dd55a92bc8811ef12c5b7345eadf73c",
"80327900e2827d7f"
],
"max_column_names": [
"9dd55a92bc8811ef12c5b7345eadf73c",
"7df79242196c255e"
],
Fixes: #22078Closesscylladb/scylladb#22225
(cherry picked from commit f899f0e411)
Closesscylladb/scylladb#22296
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.
This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.
Fixes#21015
This fixes a regression and should be backported.
- (cherry picked from commit d2ba45a01f)
- (cherry picked from commit 6e3ecc70a6)
- (cherry picked from commit 63100b34da)
Parent PR: #22263Closesscylladb/scylladb#22376
* github.com:scylladb/scylladb:
sstable_directory: do not load remote sstables in process_descriptor
sstable_directory: reintroduce `get_shards_for_this_sstable()`
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.
Fixes: #21480Closesscylladb/scylladb#22251
(cherry picked from commit 55963f8f79)
Closesscylladb/scylladb#22379
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.
Skip the keyspace repair if no_such_keyspace is thrown during preparations.
Fixes: #22073.
Requires backport to 6.1 and 6.2 as they contain the bug
- (cherry picked from commit bfb1704afa)
- (cherry picked from commit 54e7f2819c)
Parent PR: #22473Closesscylladb/scylladb#22541
* github.com:scylladb/scylladb:
test: add test to check if repair handles no_such_keyspace
repair: handle keyspace dropped
During raft upgrade, a node may gossip about a new CDC generation that
was propagated through raft. The node that receives the generation by
gossip may have not applied the raft update yet, and it will not find
the generation in the system tables. We should consider this error
non-fatal and retry to read until it succeeds or becomes obsolete.
Another issue is when we fail with a "fatal" exception and not retrying
to read, the cdc metadata is left in an inconsistent state that causes
further attempts to insert this CDC generation to fail.
What happens is we complete preparing the new generation by calling `prepare`,
we insert an empty entry for the generation's timestamp, and then we fail. The
next time we try to insert the generation, we skip inserting it because we see
that it already has an entry in the metadata and we determine that
there's nothing to do. But this is wrong, because the entry is empty,
and we should continue to insert the generation.
To fix it, we change `prepare` to return `true` when the entry already
exists but it's empty, indicating we should continue to insert the
generation.
Fixesscylladb/scylladb#21227Closesscylladb/scylladb#22093
(cherry picked from commit 4f5550d7f2)
Closesscylladb/scylladb#22545
This adds a reproducer for #22431
In cases where a tablet storage group manager had more than one storage
group, it was possible to create compaction groups outside the group0
guard, which could create problems with operations which should exclude
with compaction group creation.
(cherry picked from commit 8bff7786a8)
tablet_storage_group_manager::all_storage_groups_split() calls set_split_mode()
for each of its storage groups to create split ready compaction groups. It does
this by iterating through storage groups using std::ranges::all_of() which is
not guaranteed to iterate through the entire range, and will stop iterating on
the first occurance of the predicate (set_split_mode()) returning false.
set_split_mode() creates the split compaction groups and returns false if the
storage group's main compaction group or merging groups are not empty. This
means that in cases where the tablet storage group manager has non-empty
storage groups, we could have a situation where split compaction groups are not
created for all storage groups.
The missing split compaction groups are later created in
tablet_storage_group_manager::split_all_storage_groups() which also calls
set_split_mode(), and that is the reason why split completes successfully. The
problem is that tablet_storage_group_manager::all_storage_groups_split() runs
under a group0 guard, and tablet_storage_group_manager::split_all_storage_groups()
does not. This can cause problems with operations which should exclude with
compaction group creation. i.e. DROP TABLE/DROP KEYSPACE
(cherry picked from commit 24e8d2a55c)
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.
Skip the keyspace repair if no_such_keyspace is thrown during preparations.
(cherry picked from commit bfb1704afa)
Fixes a race condition where COMPRESSOR_NAME in zstd.cc could be
initialized before compressor::namespace_prefix due to undefined
global variable initialization order across translation units. This
was causing ZstdCompressor to be unregistered in release builds,
making it impossible to create tables with Zstd compression.
Replace the global namespace_prefix variable with a function that
returns the fully qualified compressor name. This ensures proper
initialization order and fixes the registration of the ZstdCompressor.
Fixesscylladb/scylladb#22444
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22451
(cherry picked from commit 4a268362b9)
Closesscylladb/scylladb#22510
Fixes#20633
Cannot assert on actual request_controller when releasing permit, as the
release, if we have waiters in queue, will subtract some units to hand to them.
Instead assert on permit size + waiter status (and if zero, also controller value)
* v2 - use SCYLLA_ASSERT
(cherry picked from commit 58087ef427)
Closesscylladb/scylladb#22455
When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried.
Fixes scylladb/scylladb#20814
Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2.
- (cherry picked from commit 775411ac56)
- (cherry picked from commit 16053a86f0)
- (cherry picked from commit 8c48f7ad62)
- (cherry picked from commit 3da4848810)
- (cherry picked from commit 228a66d030)
Parent PR: #22253Closesscylladb/scylladb#22358
* github.com:scylladb/scylladb:
raft: refactor `remove_from_raft_config` to use a timed `modify_config` call.
raft: Refactor functions using `modify_config` to use a common wrapper for retrying.
raft: Handle non-critical config update errors in when changing status to voter.
test: Add test to check that a node does not fail on unknown commit status error when starting up.
raft: Add run_op_with_retry in raft_group0.
To avoid potential hangs during the `remove_from_raft_config` operation, use a timed `modify_config` call.
This ensures the operation doesn't get stuck indefinitely.
(cherry picked from commit 228a66d030)
for retrying.
There are several places in `raft_group0` where almost identical code is
used for retrying `modify_config` in case of `commit_status_unknown`
error. To avoid code duplication all these places were changed to
use a new wrapper `run_op_with_retry`.
(cherry picked from commit 3da4848810)
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.
This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.
Fixes#21015
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 63100b34da)
Reintroduce `get_shards_for_this_sstable()` that was removed in commit
ad375fbb. This will be used in the following patch to ensure that an
sstable is loaded only once.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d2ba45a01f)
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.
Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.
Fixes#20638
The PR contains few additional small fixes in separate commits related to the view build status table.
It addresses flakiness issues in tests that use the view build status table to determine when view building is complete. The table may be in incorrect state due to these issues, having a row with status STARTED when it actually finished building the view, which will cause us to wait in `wait_for_view` until it timeouts.
For testing I used a test similar to `test_view_build_status_with_replace_node`, but it only creates the views and calls `wait_for_view`. Without these commits it failed in 4/1024 runs, and with the commits it passed 2048/2048.
backport to fix the bugs that affects previous versions and improve CI stability
- (cherry picked from commit b1be2d3c41)
- (cherry picked from commit 1104411f83)
- (cherry picked from commit 7a6aec1a6c)
Parent PR: #22307Closesscylladb/scylladb#22356
* github.com:scylladb/scylladb:
view_builder: hold semaphore during entire startup
view_builder: pass view name by value to write_view_build_status
view_builder: write status to tables before starting to build
Guard the whole view builder startup routine by holding the semaphore
until it's done instead of releasing it early, so that it's not
intercepted by migration notifications.
(cherry picked from commit 7a6aec1a6c)
The function write_view_build_status takes two lambda functions and
chooses which of them to run depending on the upgrade state. It might
run both of them.
The parameters ks_name and view_name should be passed by value instead
of by reference because they are moved inside each lambda function.
Otherwise, if both lambdas are run, the second call operates on invalid
values that were moved.
(cherry picked from commit 1104411f83)
In this PR, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:
(a) The following things happened in order:
1. The view builder would be constructed.
2. Right after that, a deferred lambda would be created to stop the
view builder during shutdown.
3. group0_service would be started.
4. A deferred lambda stopping group0_service would be created right
after that.
5. The view builder would be started.
(b) Because the view builder depends on group0_client, it couldn't be
started before starting group0_service. On the other hand, other
services depend on the view builder, e.g. the stream manager. That
makes changing the order of initialization a difficult problem,
so we want to avoid doing that unless we're sure it's the right
choice.
(c) Since the view builder uses group0_client, there was a possibility
of running into a segmentation fault issue in the following
scenario:
1. A call to `view_builder::mark_view_build_success()` is issued.
2. We stop group0_service.
3. `view_builder::mark_view_build_success()` calls
`announce_with_raft()`, which leads to a use-after-free because
group0_service has already been destroyed.
This very scenario took place in scylladb/scylladb#20772.
Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534,
so we revert that patch. These changes are the second attempt
to the problem where we want to solve it in a safer manner.
The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.
Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.
Backport: The patch I'm reverting made it to 6.2, so we want to backport this one there too.
Fixesscylladb/scylladb#20772Fixesscylladb/scylladb#21534
- (cherry picked from commit a5715086a4)
- (cherry picked from commit 06ce976370)
- (cherry picked from commit d1f960eee2)
Parent PR: #21909Closesscylladb/scylladb#22331
* github.com:scylladb/scylladb:
test/topology_custom: Add test for Scylla with disabled view building
main, view: Pair view builder drain with its start
Revert "main,cql_test_env: start group0_service before view_builder"
to voter.
When a node is bootstrapped and joins a cluster as a non-voter, errors can occur while committing
a new Raft record, for instance, if the Raft leader changes during this time. These errors are not
critical and should not cause a node crash, as the action can be retried.
Fixesscylladb/scylladb#20814
(cherry picked from commit 8c48f7ad62)
error when starting up.
Test that a node is starting successfully if while joining a cluster and becoming a voter, it
receives an unknown commit status error.
Test for scylladb/scylladb#20814
(cherry picked from commit 16053a86f0)
Since when calling `modify_config` it's quite often we need to do
retries, to avoid code duplication, a function wrapper that allows
a function to be called with automatic retries in case of failures
was added.
(cherry picked from commit 775411ac56)
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.
Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.
Fixesscylladb/scylladb#20638
(cherry picked from commit b1be2d3c41)
This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values.
Fixes scylladb/scylladb#19101
Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2
- (cherry picked from commit e577f1d141)
- (cherry picked from commit 39785c6f4e)
- (cherry picked from commit 155480595f)
Parent PR: #21996Closesscylladb/scylladb#22298
* github.com:scylladb/scylladb:
storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function.
storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap.
test: Add test case for checking read repair diff calculation when having conflicting keys.
Before this commit, there doesn't seem to have been a test verifying that
starting and shutting down Scylla behave correctly when the configuration
option `view_building` is set to false. In these changes, we add one.
(cherry picked from commit d1f960eee2)
In these changes, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:
(a) The following things happened in order:
1. The view builder would be constructed.
2. Right after that, a deferred lambda would be created to stop the
view builder during shutdown.
3. group0_service would be started.
4. A deferred lambda stopping group0_service would be created right
after that.
5. The view builder would be started.
(b) Because the view builder depends on group0_client, it couldn't be
started before starting group0_service. On the other hand, other
services depend on the view builder, e.g. the stream manager. That
makes changing the order of initialization a difficult problem,
so we want to avoid doing that unless we're sure it's the right
choice.
(c) Since the view builder uses group0_client, there was a possibility
of running into a segmentation fault issue in the following
scenario:
1. A call to `view_builder::mark_view_build_success()` is issued.
2. We stop group0_service.
3. `view_builder::mark_view_build_success()` calls
`announce_with_raft()`, which leads to a use-after-free because
group0_service has already been destroyed.
This very scenario took place in scylladb/scylladb#20772.
Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534.
We reverted that change in the previous commit. These changes are the
second attempt to the problem where we want to solve it in a safer manner.
The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.
Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.
Fixesscylladb/scylladb#20772
(cherry picked from commit 06ce976370)
The patch solved a problem related to an initialization order
(scylladb/scylladb#20772), but we ran into another one: scylladb/scylladb#21534.
After moving the initialization of group0_service, it ended up being destroyed
AFTER the CDC generation service would. Since CDC generations are accessed
in `storage_service::topology_state_load()`:
```
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
co_await _cdc_gens.local().handle_cdc_generation(gen_id);
```
we started getting the following failure:
```
Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed.
```
We're reverting the patch to go back to a more stable version of Scylla
and in the following commit, we'll solve the original issue in a more
systematic way.
This reverts commit 7bad8378c7.
(cherry picked from commit a5715086a4)
Add the test file name to `ScyllaClusterManager` log file names alongside the test function name.
This avoids race conditions when tests with the same function names are executed simultaneously.
Fixesscylladb/scylladb#21807
Backport: not needed since this is a fix in the testing scripts.
Closesscylladb/scylladb#22192
(cherry picked from commit 2f1731c551)
Closesscylladb/scylladb#22249
function.
The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the
`schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve`
initialized with the same variable in `abstract_read_executor` is redundant.
(cherry picked from commit 155480595f)
diff calculation hashmap.
This update addresses an issue in the mutation diff calculation algorithm used during read repair.
Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on
the Murmur3 hash function, it could generate duplicate values for different partition keys, causing
corruption in the affected rows' values.
Fixesscylladb/scylladb#19101
(cherry picked from commit 39785c6f4e)
conflicting keys.
The test updates two rows with keys that result in a Murmur3 hash collision, which
is used to generate Scylla tokens. These tokens are involved in read repair diff
calculations. Due to the identical token values, a hash map key collision occurs.
Consequently, an incorrect value from the second row (with a different primary key)
is then sent for writing as 'repaired', causing data corruption.
(cherry picked from commit e577f1d141)
This series attempts to get read of flakiness in cache_algorithm_test by solving two problems.
Problem 1:
The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...
Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.
Problem 2:
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.
This can cause the test to fail spuriously.
With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.
This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
Fixesscylladb/scylladb#21536
(cherry picked from commit 1fffd976a4)
(cherry picked from commit 6caaead4ac)
Parent PR: scylladb/scylladb#21948Closesscylladb/scylladb#22228
* github.com:scylladb/scylladb:
cache_algorithm_test: harden against stats being confused by background activity
cache_algorithm_test: fix a use of an uninitialized variable
Currently task_manager_module::is_aborted checks the tasks local
to caller's shard on a given shard.
Fix the method to check the task map local to the given shard.
Fixes: #22156.
Closesscylladb/scylladb#22161
(cherry picked from commit a91e03710a)
Closesscylladb/scylladb#22197
When we open a PR with conflicts, the PR owner gets a notification about the assignment but has no idea if this PR is with conflicts or not (in Scylla it's important since CI will not start on draft PR)
Let's add a comment to notify the user we have conflicts
Closesscylladb/scylladb#21939
(cherry picked from commit 2e6755ecca)
Closesscylladb/scylladb#22190
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.
Added a testcase to verify the fix.
Fixes#21887
This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.
Closesscylladb/scylladb#21895
* github.com:scylladb/scylladb:
sstables_manager: reclaim memory from sstables on unlink
sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
sstables: introduce disable_component_memory_reload()
sstables_manager: log sstable name when reclaiming components
(cherry picked from commit d4129ddaa6)
Closesscylladb/scylladb#21998
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.
This can cause the test to fail spuriously.
With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.
This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
(cherry picked from commit 6caaead4ac)
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...
Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.
(cherry picked from commit 1fffd976a4)
New logs allow us to easily distinguish two cases in which
waiting for apply times out:
- the node didn't receive the entry it was waiting for,
- the node received the entry but didn't apply it in time.
Distinguishing these cases simplifies reasoning about failures.
The first case indicates that something went wrong on the leader.
The second case indicates that something went wrong on the node
on which waiting for apply timed out.
As it turns out, many different bugs result in the `read_barrier`
(which calls `wait_for_apply`) timeout. This change should help
us in debugging bugs like these.
We want to backport this change to all supported branches so that
it helps us in all tests.
Fixesscylladb/scylladb#22160Closesscylladb/scylladb#22159
The series contains small fixes to the gossiper one of which fixes#21930. Others I noticed while debugged the issue.
Fixes: #21930
(cherry picked from commit 91cddcc17f)
Parent PR: #21956Closesscylladb/scylladb#21991
* github.com:scylladb/scylladb:
gossiper: do not reset _just_removed_endpoints in non raft mode
gossiper: do not call apply for the node's old state
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.
This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.
A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.
Fixes: scylladb/scylladb#20082Closesscylladb/scylladb#21600
(cherry picked from commit 6c90a25014)
Closesscylladb/scylladb#21822
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.
Fixesscylladb/scylladb#20754
The PR should be backported to 6.2, this version has view builder on group0.
Closesscylladb/scylladb#21708
* github.com:scylladb/scylladb:
test/topology_custom/test_view_build_status: add reproducer
service/topology_coordinator: migrate view builder only if all nodes are up
(cherry picked from commit def51e252d)
Closesscylladb/scylladb#21850
This patch reverts 324b3c43c0 and adds synchronous versions of `service_level_controller::find_effective_service_level()` and `client_state::maybe_update_per_service_level_params()`.
It isn't safe to do asynchronous calls in `for_each_gently`, as the
connection may be disconnected while a call in callback preempts.
Fixesscylladb/scylladb#21801Closesscylladb/scylladb#21761
* github.com:scylladb/scylladb:
Revert "generic_server: use async function in `for_each_gently()`"
transport/server: use synchronous calls in `for_each_gently` callback
service/client_state: add synchronous method to update service level params
qos/service_level_controller: add `find_cached_effective_service_level`
(cherry picked from commit c601f7a359)
Closesscylladb/scylladb#21849
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.
I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO (relies on https://github.com/scylladb/scylladb/pull/20522).
But there is IO during promoted index search (via cached_file).
Before the patch this workload was doing 4k req/s, after the patch it does 30k req/s.
The problem is much less pronounced if there is data file or partition index IO involved
because that IO will signal read concurrency semaphore to invite more concurrency.
Fixes#21325
(cherry picked from commit 868f5b59c4)
(cherry picked from commit 0f2101b055)
Refs #21323Closesscylladb/scylladb#21358
* github.com:scylladb/scylladb:
utils: cached_file: Mark permit as awaiting on page miss
utils: cached_file: Push resource_unit management down to cached_file
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.
The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.
Fixesscylladb/scylladb#21763Closesscylladb/scylladb#21773
(cherry picked from commit 373855b493)
Closesscylladb/scylladb#21893
The function get_service_levels is used to retrieve all service levels
and it is called from multiple different contexts.
Importantly, it is called internally from the context of group0 state reload,
where it should be executed with a long timeout, similarly to other
internal queries, because a failure of this function affects the entire
group0 client, and a longer timeout can be tolerated.
The function is also called in the context of the user command LIST
SERVICE LEVELS, and perhaps other contexts, where a shorter timeout is
preferred.
The commit introduces a function parameter to indicate whether the
context is internal or not. For internal context, a long timeout is
chosen for the query. Otherwise, the timeout is shorter, the same as
before. When the distinction is not important, a default value is
chosen which maintains the same behavior.
The main purpose is to fix the case where the timeout is too short and causes
a failure that propagates and fails the group0 client.
Fixesscylladb/scylladb#20483Closesscylladb/scylladb#21748
(cherry picked from commit 53224d90be)
Closesscylladb/scylladb#21890
Topology request table may change between the code reading it and
calling to cv::when() since reading is a preemption point. In this
case cv:signal can be missed. Detect that there was no signal in between
reading and waiting by introducing reload_count which is increased each
time the state is reloaded and signaled. If the counter is different
before and after reading the state may have change so re-check it again
instead of sleeping.
Closesscylladb/scylladb#21713
* github.com:scylladb/scylladb:
topology_coordinator: introduce reload_count in topology state and use it to prevent race
storage_service: use conditional_variable::when in co-routines consistently
(cherry picked from commit 8f858325b6)
Closesscylladb/scylladb#21803
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.
I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO. But there is IO during
promoted index search (via cached_file). Before the patch this
workload was doing 4k req/s, after the patch it does 30k req/s.
The problem is much less pronounced if there is data file or index
file IO involved because that IO will signal read concurrency
semaphore to invite more concurrency.
(cherry picked from commit 0f2101b055)
It saves us permit operations on the hot path when we hit in cache.
Also, it will lay the ground for marking the permit as awaiting later.
(cherry picked from commit 868f5b59c4)
In commit 2596d157, we added a condition to run auto-backport.py only
when the GitHub Action is triggered by a push to the default branch.
However, this introduced an unexpected error due to incorrect condition
handling.
Problem:
- `github.event.before` evaluates to an empty string
- GitHub Actions' single-pass expression evaluation system causes
the step to always execute, regardless of `github.event_name`
Despite GitHub's documentation suggesting that ${{ }} can be omitted,
it recommends using explicit ${{}} expressions for compound conditions.
Changes:
- Use explicit ${{}} expression for compound conditions
- Avoid string interpolation in conditional statements
Root Cause:
The previous implementation failed because of how GitHub Actions
evaluates conditional expressions, leading to an unintended script
execution and a 404 error when attempting to compare commits.
Example Error:
```
python .github/scripts/auto-backport.py --repo scylladb/scylladb --base-branch refs/heads/master --commits ..2b07d93beac7bc83d955dadc20ccc307f13f20b6
shell: /usr/bin/bash -e {0}
env:
DEFAULT_BRANCH: master
GITHUB_TOKEN: ***
Traceback (most recent call last):
File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 201, in <module>
main()
File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 162, in main
commits = repo.compare(start_commit, end_commit).commits
File "/usr/lib/python3/dist-packages/github/Repository.py", line 888, in compare
headers, data = self._requester.requestJsonAndCheck(
File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
return self.__check(
File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/commits/commits#compare-two-commits", "status": "404"}
```
Fixesscylladb/scylladb#21808
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21809
(cherry picked from commit e04aca7efe)
Closesscylladb/scylladb#21820
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.
Fixes#20030
This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.
Closesscylladb/scylladb#21582
* github.com:scylladb/scylladb:
compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
compaction_group: rename `update_main_sstable_list_on_compaction_completion`
compaction_group: update maintenance sstable set on scrub compaction completion
compaction_group: store table::sstable_list_builder::result in replacement_desc
table::sstable_list_builder: remove old sstables only from current list
table::sstable_list_builder: return removed sstables from build_new_list
(cherry picked from commit 58baeac0ad)
Closesscylladb/scylladb#21790
schema_change_test currently fails due to failure to start a cql test
env in unit tests after the point where this is called (in one of the
test cases):
forward_jump_clocks(std::chrono::seconds(60*60*24*31));
The problem manifests with a failure to join the cluster due to
missing_column exception ("missing_column: done") being thrown from
system_keyspace::get_topology_request_state(). It's a symptom of
join request being missing in system.topology_requests. It's missing
because the row is expired.
When request is created, we insert the
mutations with intended TTL of 1 month. The actual TTL value is
computed like this:
ttl_opt topology_request_tracking_mutation_builder::ttl() const {
return std::chrono::duration_cast<std::chrono::seconds>(std::chrono::microseconds(_ts)) + std::chrono::months(1)
- std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch());
}
_ts comes from the request_id, which is supposed to be a timeuuid set
from current time when request starts. It's set using
utils::UUID_gen::get_time_UUID(). It reads the system clock without
adding the clock offset, so after forward_jump_clocks(), _ts and
gc_clock::now() may be far off. In some cases the accumulated offset
is larger than 1month and the ttl becomes negative, causing the
request row to expire immediately and failing the boot sequence.
The fix is to use db_clock, which respects offsets and is consistent
with gc_clock.
The test doesn't fail in CI becuase there each test case runs in a
separate process, so there is no bootstrap attempt (by new cql test
env) after forward_jump_clocks().
Closes scylladb/scylladb#21558
(cherry picked from commit 1d0c6aa26f)
Closesscylladb/scylladb#21584Fixes#21581
Task status information from nodetool commands is not retained permanently:
- Status of completed tasks is only kept for `task_ttl_in_seconds`
- Status is removed after being queried, making it a one-time operation
This behavior is important for users to understand since subsequent
queries for the same completed task will not return any information.
Add documentation to make this clear to users.
Fixesscylladb/scylladb#21757
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21386
(cherry picked from commit afeff0a792)
Closesscylladb/scylladb#21759
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.
Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency
Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions
Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios
Fixesscylladb/scylladb#21724
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21726
(cherry picked from commit 65949ce607)
Closesscylladb/scylladb#21734
Before these changes, we didn't wait for the materialized views to
finish building before writing to the base table. That led to generating
an additional view update, which, in turn, led to test failures.
The scenario corresponding to the summary above looked like this:
1. The test creates an empty table and MVs on it.
2. The view builder starts, but it doesn't finish immediately.
3. The test performs mutations to the base table. Since the views
already exist, view updates are generated.
4. Finally, the view builder finishes. It notices that the base
table has a row, so it generates a view update for it because
it doesn't notice that we already have data in the view.
We solve it by explicitly waiting for both views to finish building
and only then start writing to the base table.
Additionally, we also fix a lifetime issue of the row the test revolves
around, further stabilizing CI.
Fixes https://github.com/scylladb/scylladb/issues/20889
Backport: These changes have no semantic effect on the codebase,
but they stabilize CI, so we want to backport them to the maintained
versions of Scylla.
Closesscylladb/scylladb#21632
* github.com:scylladb/scylladb:
test/boost/view_schema_test.cc: Increase TTL in test_view_update_generating_writetime
test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime
(cherry picked from commit 733a4f94c7)
Closesscylladb/scylladb#21640
tablet_repair_task_impl keeps a vector of tablet_repair_task_meta,
each of which keeps an effective_replication_map_ptr. So, after
the task completes, the token metadata version will not change for
task_ttl seconds.
Implement tablet_repair_task_impl::release_resources method that clears
tablet_repair_task_meta vector when the task finishes.
Set task_ttl to 1h in test_tablet_repair to check whether the test
won't time out.
Fixes: #21503.
Closesscylladb/scylladb#21504
(cherry picked from commit 572b005774)
Closesscylladb/scylladb#21622
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.
This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2
Fixes: scylladb/scylladb#21096Closesscylladb/scylladb#21609
(cherry picked from commit acd643bd75)
Closesscylladb/scylladb#21764
Currently, task_manager_module::abort_all_repairs marks top-level repairs as aborted (but does not abort them) and aborts all existing shard tasks.
A running repair checks whether its id isn't contained in _aborted_pending_repairs and then proceeds to create shard tasks. If abort_all_repairs is executed after _aborted_pending_repairs is checked but before shard tasks are created, then those new tasks won't be aborted. The issue is the most severe for tablet_repair_task_impl that checks the _aborted_pending_repairs content from different shards, that do not see the top-level task. Hence the repair isn't stopped but it creates shard repair tasks on all shards but the one that initialized repair.
Abort top-level tasks in abort_all_repairs. Fix the shard on which the task abort is checked.
Fixes: #21612.
Needs backport to 6.1 and 6.2 as they contain the bug.
Closesscylladb/scylladb#21616
* github.com:scylladb/scylladb:
test: add test to check if repair is properly aborted
repair: add shard param to task_manager_module::is_aborted
repair: use task abort source to abort repair
repair: drop _aborted_pending_repairs and utilize tasks abort mechanism
repair: fix task_manager_module::abort_all_repairs
(cherry picked from commit 5ccbd500e0)
Closesscylladb/scylladb#21642
Alternator's "/localnodes" HTTP requests is supposed to return the list
of nodes in the local DC to which the user can send requests.
Before commit bac7c33313 we used the
gossiper is_alive() method to determine if a node should be returned.
That commit changed the check to is_normal() - because a node can be
alive but in non-normal (e.g., joining) state and not ready for
requests.
However, it turns out that checking is_normal() is not enough, because
if node is stopped abruptly, other nodes will still consider it "normal",
but down (this is so-called "DN" state). So we need to check **both**
is_alive() and is_normal().
This patch also adds a test reproducing this case, where a node is
shut down abruptly. Before this patch, the test failed ("/localnodes"
continued to return the dead node), and after it it passes.
Fixes#21538
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21540
(cherry picked from commit 7607f5e33e)
Closesscylladb/scylladb#21634
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.
Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.
Fixes: scylladb/scylladb#19959Closesscylladb/scylladb#21477
(cherry picked from commit 92db2eca0b)
Closesscylladb/scylladb#21556
After merged 5a470b2bfb, we found that scylla_raid_setup fails on offline mode
installation.
This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect
internet.
Seems like it occur because of missing "policycoreutils-python-utils"
package, which is the package for "semange" command.
So we need to implement the relabeling patch without using the command.
Fixes https://github.com/scylladb/scylladb/issues/21441
Also, since Amazon Linux 2 has different package name for semange, we need to
adjust package name.
Fixes https://github.com/scylladb/scylladb/issues/21351Closesscylladb/scylladb#21474
* github.com:scylladb/scylladb:
scylla_raid_setup: support installing semanage on Amazon Linux 2
scylla_raid_setup: fix failure on SELinux package installation
(cherry picked from commit 1c212df62d)
Closesscylladb/scylladb#21547
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them by continue with shutdown.
stop_ongoing_compactions, in particular, currently returns the status
of stopped compaction tasks from `stop_tasks`, but still all tasks
must be stopped after it, even if they failed, so assert that
and ignore the errors.
Fixes scylladb/scylladb#21159
* Needs backport to 6.2 and 6.1, as commit 8cc99973eb causes handles storage that might cause compaction tasks to fail and eventually terminate on shudown when the exceptions are thrown in noexcept context in the deferred stop destructor body
(cherry picked from commit e942c074f2)
(cherry picked from commit d8500472b3)
(cherry picked from commit c08ba8af68)
(cherry picked from commit a7a55298ea)
(cherry picked from commit 6cce67bec8)
Refs #21299Closesscylladb/scylladb#21434
* github.com:scylladb/scylladb:
compaction_manager: stop: await _stop_future if engaged
compaction_manager: really_do_stop: assert that no tasks are left behind
compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
compaction/compaction_manager: stop_tasks(): unlink stopped tasks
compaction/compaction_manager: make _tasks an intrusive list
The current condition that consults the compaction manager
state for awaiting `_stop_future` works since _stop_future
is assigned after the state is set to `stopped`, but it is
incidental. What matters is that `_stop_future` is engaged.
While at it, exchange _stop_future with a ready future
so that stop() can be safely called multiple times.
And dropped the superfluous co_return.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6cce67bec8)
stop_ongoing_compactions now ignores any errors returned
by tasks, and it should leave no task left behind.
Assert that here, before the compaction_manager is destroyed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a7a55298ea)
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them but continue with shutdown.
Leaked errors on the stop path may cause termination
on shutdown, when called in a deferred action destructor.
Fixesscylladb/scylladb#21298
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c08ba8af68)
Stopped tasks currently linger in _tasks until the fiber that created
the task is scheduled again and unlinks the task. This window between
stop and remove prevents reliable checks for empty _tasks list after all
tasks are stopped.
Unlink the task early so really_do_stop() can safely check for an empty
_tasks list (next patch).
(cherry picked from commit d8500472b3)
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.
Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
(cherry picked from commit e942c074f2)
In scylladb/scylladb#19745, view_builder was migrated to group0 and since then it is dependant on group0_service.
Because of this, group0_service should be initialized/destroyed before/after view_builder.
This patch also adds error injection to `raft_server_with_timeouts::read_barrier`, which does 1s sleep before doing the read barrier. There is a new test which reproduces the use after free bug using the error injection.
Fixesscylladb/scylladb#20772scylladb/scylladb#19745 is present in 6.2, so this fix should be backported to it.
Closesscylladb/scylladb#21471
* github.com:scylladb/scylladb:
test/boost/secondary_index_test: add test for use after free
api/raft: use `get_server_with_timeouts().read_barrier()` in coroutines
main,cql_test_env: start group0_service before view_builder
(cherry picked from commit 7021efd6b0)
Closesscylladb/scylladb#21506
For performance reasons, mutation_partition_v2::maybe_drop(), and by extension
also mutation_partition_v2::apply_monotonically(mutation_partition_v2&&)
can evict empty row entries, and hence change the continuity of the merged
entry.
For checking that apply_to_incomplete respects continuity,
test_apply_to_incomplete_respects_continuity obtains the continuity of
the partition entry before and after apply_to_incomplete by calling
e.squashed().get_continuity(). But squashed() uses apply_monotonically(),
so in some circumstances the result of squashed() can have smaller
continuity than the argument of squashed(), which messes with the thing
that the test is trying to check, and causes spurious failures.
This patch changes the method of calculating the continuity set,
so that it matches the entry exactly, fixing the test failures.
Fixesscylladb/scylladb#13757Closesscylladb/scylladb#21459
(cherry picked from commit 35921eb67e)
Closesscylladb/scylladb#21497
Since Scylla is a public repo, when we create a fork, it doesn't fork the team and permissions (unlike private repos where it does).
When we have a backport PR with conflicts, the developers need to be able to update the branch to fix the conflicts. To do so, we modified the logic of the backport automation as follows:
- Every backport PR (with and without conflicts) will be open directly on the `scylladbbot` fork repo
- When there are conflicts, an email will be sent to the original PR author with an invitation to become a contributor in the `scylladbbot` fork with `push` permissions. This will happen only once if Auther is not a contributor.
- Together with sending the invite, all backport labels will be removed and a comment will be added to the original PR with instructions
- The PR author must add the backport labels after the invitation is accepted
Fixes: https://github.com/scylladb/scylladb/issues/18973Closesscylladb/scylladb#21401
(cherry picked from commit 77604b4ac7)
Closesscylladb/scylladb#21466
Adding an auto-backport.py script to handle backport automation instead of Mergify.
The rules of backport are as follows:
* Merged or Closed PRs with any backport/x.y label (one or more) and promoted-to-master label
* Backport PR will be automatically assigned to the original PR author
* In case of conflicts the backport PR will be open in the original autoor fork in draft mode. This will give the PR owner the option to resolve conflicts and push those changes to the PR branch (Today in Scylla when we have conflicts, the developers are forced to open another PR and manually close the backport PR opened by Mergify)
* Fixing cherry-pick the wrong commit SHA. With the new script, we always take the SHA from the stable branch
* Support backport for enterprise releases (from Enterprise branch)
Fixes: https://github.com/scylladb/scylladb/issues/18973
(cherry picked from commit f9e171c7af)
Closesscylladb/scylladb#21469
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.
So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.
This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.
It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.
Fixes#20626.
Closesscylladb/scylladb#20737
* github.com:scylladb/scylladb:
tablet: Fix single-sstable split when attaching new unsplit sstables
replica: Fix tablet split execute after restart
(cherry picked from commit bca8258150)
Ref scylladb/scylladb#21415
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
sstable B is split first, and moved from main (unsplit) group to a
split-ready group
now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes https://github.com/scylladb/scylladb/issues/20044.
Please replace this line with justification for the backport/* labels added to this PR
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.
(cherry picked from commit bcd358595f)
(cherry picked from commit 93815e0649)
Refs https://github.com/scylladb/scylladb/pull/20939Closesscylladb/scylladb#21206
* github.com:scylladb/scylladb:
replica: Fix tombstone GC during tablet split preparation
service: Improve error handling for split
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes#20044.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 93815e0649)
Retry wasn't really happening since the loop was broken and sleep
part was skipped on error. Also, we were treating abort of split
during shutdown as if it were an actual error and that confused
longevity tests that parse for logs with error level. The fix is
about demoting the level of logs when we know the exception comes
from shutdown.
Fixes#20890.
(cherry picked from commit bcd358595f)
Fix how regular tasks that have a virtual parent are created
in task_manager::module::make_task: set sequence number
of a task and subscribe to module's abort source.
Fixes: #21278.
Needs backport to 6.2
(cherry picked from commit 1eb47b0bbf)
(cherry picked from commit 910a6fc032)
Refs #21280Closesscylladb/scylladb#21332
* github.com:scylladb/scylladb:
tasks: fix sequence number assignment
tasks: fix abort source subscription of virtual task's child
Currently, test_repair_succeeds_with_unitialized_bm checks whether
repair finishes successfully and the error is properly handled
if batchlog_manager isn't initialized. Error handling depends on
logs, making the test fragile to external conditions and flaky.
Drop the error handling check, successful repair is a sufficient
passing condition.
Fixes: #21167.
(cherry picked from commit 85d9565158)
Closesscylladb/scylladb#21330
The skipped ranges should be multiplied by the number of tables
Otherwise the finished ranges ratio will not reach 100%.
Fixes#21174
(cherry picked from commit cffe3dc49f)
(cherry picked from commit 1392a6068d)
(cherry picked from commit 9868ccbac0)
Refs #21252Closesscylladb/scylladb#21313
* github.com:scylladb/scylladb:
test: Add test_node_ops_metrics.py
repair: Make the ranges more consistent in the log
repair: Fix finished ranges metrics for removenode
Despite OSS doesn't limit number of created service levels, match the
enterprise limit to decrease divergence in the test between OSS and
enterprise.
Fixesscylladb/scylladb#21044
(cherry picked from commit 846d94134f)
Closesscylladb/scylladb#21282
Fixes#21159
When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.
Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.
(cherry picked from commit c28a5173d9)
Closesscylladb/scylladb#21393
When a compaction_group is removed via `compaction_manager::remove`,
it is erase from `_compaction_state`, and therefore compaction
is definitely not enabled on it.
This triggers an internal error if tablets are cleaned up
during drop/truncate, which checks that compaction is disabled
in all compaction groups.
Note that the callers of `compaction_disabled` aren't really
interested in compaction being actively disabled on the
compaction_group, but rather if it's enabled or not.
A follow-up patch can be consider to reverse the logic
and expose `compaction_enabled` rather than `compaction_disabled`.
Fixesscylladb/scylladb#20060
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 1c55747637)
Closesscylladb/scylladb#21404
Current code takes a reference and holds it past preemption points. And
while the state itself is not suppose to change the reference may
become stale because the state is re-created on each raft topology
command.
Fix it by taking a copy instead. This is a slow path anyway.
Fixes: scylladb/scylladb#21220
(cherry picked from commit fb38bfa35d)
Closesscylladb/scylladb#21361
In the current scenario, the nodetool status doesn’t display information regarding zero token nodes. For example, if 5 nodes are spun by the administrator, out of which, 2 nodes are zero token nodes, then nodetool status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id ” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
A test is also added in nodetool/test_status.py to verify this logic. This test fails without this commit’s zero token node support logic, hence verifying the behavior.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions don't support zero token nodes.
Fixes: scylladb/scylladb#19849Fixes: scylladb/scylladb#17857
(cherry picked from commit 72f3c95a63)
(cherry picked from commit 39dfd2d7ac)
(cherry picked from commit c00d40b239)
Refs scylladb/scylladb#20909Closesscylladb/scylladb#21334
* github.com:scylladb/scylladb:
fix nodetool status to show zero-token nodes
test: move `wait_for_first_completed` to pylib/util.py
token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
In the current scenario, the nodetool status doesn’t display information
regarding zero token nodes. For example, if 5 nodes are spun by the
administrator, out of which, 2 nodes are zero token nodes, then nodetool
status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id
” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
Robust topology tests are added, which spins up scylla nodes and confirm nodetool
status output for various cases, providing good coverage.
A test is also added in nodetool/test_status.py to verify this logic. These tests fail
without this commit’s zero token node support logic, hence verifying the behavior.
The test `test_status_keyspace_joining_node` has been removed. This test is
based on case where host_id=None, which is impossible. Since we now use
host_id_map for node discovery in nodetool, the nodes with "host_id=None"
go undetected. Since this case is anyway impossible, we can get rid of this.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions dont support zero token nodes.
Fixes: scylladb/scylladb#19849
(cherry picked from commit c00d40b239)
Rename host_id map getter, 'get_endpoint_to_host_id_map_for_reading' to 'get_endpoint_to_host_id_map_'
Also modify the getter to return information regarding joining nodes as well.
This getter will later be used for retrieving the nodes in nodetool status, hence it needs to show all nodes,
including joining ones.
The function name suffix `_for_reading` suggests that the function was used
in some other places in the past, and indeed if we need endpoints
"for reading" then we cannot show joining endpoints. But it was confirmed
that this function is currently only used by "/storage_service/host_id" endpoint,
hence it can be modified as required.
Fixes: scylladb/scylladb#17857
(cherry picked from commit 72f3c95a63)
Currently, if a regular task does not have a parent or its parent
is a virtual tasks then it subscribes to module's abort source
in task_manager::task::impl constructor. However, at this point
the kind of the task's parent isn't set. Due to that, children
of virtual tasks aren't aborted on shutdown.
Subscribe to module's abort source in task::impl::set_virtual_parent.
(cherry picked from commit 1eb47b0bbf)
This collector reads nvme temperature sensor, which was observed to
cause bad performance on Azure cloud following the reading of the
sensor for ~6 seconds. During the event, we can see elevated system
time (up to 30%) and softirq time. CPU utilization is high, with
nvm_queue_rq taking several orders of magnitude more time than
normally. There are signs of contention, we can see
__pv_queued_spin_lock_slowpath in the perf profile, called. This
manifests as latency spikes and potentially also throughput drop due
to reduced CPU capacity.
By default, the monitoring stack queries it once every 60s.
(cherry picked from commit 93777fa907)
Closesscylladb/scylladb#21304
Consider the number of tables for the number of ranges logging. Make it
more consistent with the log when the ops starts.
(cherry picked from commit 1392a6068d)
The skipped ranges should be multiplied by the number of tables.
Otherwise the finished ranges ratio will not reach 100%.
Fixes#21174
(cherry picked from commit cffe3dc49f)
The stream-session is the receiving end of streaming, it reads the
mutation fragment stream from an RPC stream and writes it onto the disk.
As such, this part does no disk IO and therefore, using a permit with
count resources is superfluous. Furthermore, after
d98708013c, the count resources on this
permit can cause a deadlock on the receiver end, via the
`db::view::check_view_update_path()`, which wants to read the content of
a system table and therefore has to obtain a permit of its own.
Switch to a tracking-only permit, primarily to resolve the deadlock, but
also because admission is not necessary for a read which does no IO.
Refs: scylladb/scylladb#20885 (partial fix, solves only one of the deadlocks)
Fixes: scylladb/scylladb#21264
(cherry picked from commit dbb26da2aa)
Closesscylladb/scylladb#21303
ALTER tablets-enabled KEYSPACES (KS) may fail due to
group0_concurrent_modification, in which case it's repeated by a for
loop surrounding the code. But because raft's add_entry consumes the
raft's guard (by std::move'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned for loop altogether and rethrow the exception, as the rf_change event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.
Fixes: https://github.com/scylladb/scylladb/issues/21102
Should be backported to every 6.x branch, as it may lead to a crash.
(cherry picked from commit de511f56ac)
(cherry picked from commit 3f4c8a30e3)
(cherry picked from commit 522bede8ec)
Refs https://github.com/scylladb/scylladb/pull/21121Closesscylladb/scylladb#21256
* github.com:scylladb/scylladb:
test: topology: add disable_schema_agreement_wait utility function
test: add UT to test retrying ALTER tablets KEYSPACE
cql/tablets: fix indentation in `rf_change` event handler
cql/tablets: fix retrying ALTER tablets KEYSPACE
Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will happen.
Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed.
After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress.
Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix)
Fixes: https://github.com/scylladb/scylladb/issues/21263
This fixes (indirectly) a regression introduced by d98708013c so it has to be backported to 6.2
(cherry picked from commit e1d8cddd09)
Refs scylladb/scylladb#21058Closesscylladb/scylladb#21178
* github.com:scylladb/scylladb:
test/boost/mutation_test: add test for multishard permit safety
test/lib/reader_lifecycle_policy: add semaphore factory to constructor
test/lib/reader_lifecycle_policy: rename factory_function
repair/row_level: drop now unneeded release_base_resource() calls
readers/multishard: make multishard reader safe to create with admitted permits
The test_view_build_status_migration_to_v2 test case creates a new view
(vt2) after peforming the view_build_status -> view_build_status_v2
migration and waits until it is built by `wait_for_view_v2` function. It
works by waiting until a SELECT from view_build_status_v2 will return
the expected number of rows for a given view.
However, if the host parameter is unspecified, it will query only one
node on each attempt. Because `view_build_status_v2` is managed via
raft, queries always return data from the queried node only. It might
happen that `wait_for_view_v2` fetches expected results from one node
while a different node might be lagging behind the group0 coordinator
and might not have all data yet.
In case of test_view_build_status_migration_to_v2 this is a problem - it
first uses `wait_for_view_v2` to wait for view, later it queries
`view_build_status_v2` on a random node and asserts its state - and
might fail because that node didn't have the newest state yet.
Fix the issue by issuing `wait_for_view_v2` in parallel for all nodes in
the cluster and waiting until all nodes have the most recent state.
Fixes: scylladb/scylladb#21060
(cherry picked from commit a380a2efd9)
Closesscylladb/scylladb#21129
When there are zero tablets, tablet_metadata::_balancing_enabled
is ignored in the copy.
The property not being preserved can result in balancer not
respecting user's wish to disable balancing when a replica is
created later on.
Fixes#21175.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit dfc217f99a)
Closesscylladb/scylladb#21190
Add a test checking that the multishard reader will not deadlock, when
created with an admitted permit, on a semaphore with a single count
resource.
(cherry picked from commit e1d8cddd09)
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.
(cherry picked from commit 5a3fd69374)
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.
(cherry picked from commit c8598e21e8)
Passing an admitted permit -- i.e. one with count resources on it -- to
the multishard reader, will possibly result in a deadlock, because the
permit of the multishard reader is destroyed after the permits of its
child readers. Therefore its semaphore resources won't be automatically
released until children acquire their own resources.
This creates a dependency (an edge in the "resource allocation graph"),
where the semaphore used by the multishard reader depends on the
semaphores used by children. When such dependencies create a cycle, and
permits are acquired by different reads in just the right order, a
deadlock will happen.
Users of the multishard reader have to be aware of this gotcha -- and of
course they aren't. This is small wonder, considering that not even the
documentation on the multishard reader mentions this problem.
To work around this, the user has to call
`reader_permit::release_base_resources()` on the permit, before passing
it to the multishard reader.
On multiple occasions, developers (including the very author of the
multishard reader), forgot or didn't know about this and this resulted
in deadlocks down the line.
This is a design-flaw of the multishard reader, which is addressed in
this patch, after which, it is safe to pass admitted or not admitted
permits to the multishard reader, it will handle the call to
`release_base_resources()` if needed.
(cherry picked from commit 218ea449a5)
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.
Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.
Fixes#20916
`perf-simple-query` stats before and after this fix :
`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59393 insns/op, 24029 cycles/op, 0 errors)
97551.14 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59376 insns/op, 23966 cycles/op, 0 errors)
96599.92 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59367 insns/op, 23998 cycles/op, 0 errors)
97774.91 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59370 insns/op, 23968 cycles/op, 0 errors)
97796.13 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59368 insns/op, 23947 cycles/op, 0 errors)
throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19
// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59392 insns/op, 24058 cycles/op, 0 errors)
97311.48 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59375 insns/op, 24005 cycles/op, 0 errors)
98043.10 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59381 insns/op, 23941 cycles/op, 0 errors)
96750.31 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59396 insns/op, 24025 cycles/op, 0 errors)
93381.21 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59390 insns/op, 24097 cycles/op, 0 errors)
throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```
This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.
Closesscylladb/scylladb#20985
* github.com:scylladb/scylladb:
topology-custom: add test to verify tombstone gc in read path
replica/table: check memtable before discarding tombstone during read
compaction_group: track maximum timestamp across all sstables
(cherry picked from commit 519e167611)
Backported from #20985 to 6.2.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#21251
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.
(cherry picked from commit 522bede8ec)
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
`topology_coordinator::handle_topology_coordinator_error` handling the
case of `group0_concurrent_modification` has been extended with logging
in order not to write catch-log-throw boilerplate.
Note: refactor is implemented in the follow-up commit.
Fixes: scylladb/scylladb#21102
(cherry picked from commit de511f56ac)
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.
Fixes#20043
(cherry picked from commit f09fe4f351)
(cherry picked from commit e5bf376cbc)
(cherry picked from commit 1863ccd900)
Refs #21020Closesscylladb/scylladb#21111
* github.com:scylladb/scylladb:
tablets: Validate system.tablets update
group0_client: Introduce change validation
group0_client: Add shared_token_metadata dependency
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.
If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add validate_change() methods (well, a template and an overload) that
are called by prepare_command() and are supposed to validate the
proposed change before it hits persistent storage
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.
Fixes#19325
(cherry picked from commit 56c971373c)
(cherry picked from commit 0ac450de05)
Refs #20528Closesscylladb/scylladb#21211
* github.com:scylladb/scylladb:
scylla_raid_setup: configure SELinux file context
scylla_coredump_setup: fix SELinux configuration for RHEL9
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump
because the service only have write acess with systemd_coredump_var_lib_t.
To make it writable, we need to add file context rule for
/var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.
Fixes#20573
(cherry picked from commit 0ac450de05)
Seems like specific version of systemd pacakge on RHEL9 has a bug on
SELinux configuration, it introduced "systemd-container-coredump" module
to provide rule for systemd-coredump, but not enabled by default.
We have to manually load it, otherwise it causes permission error.
Fixes#19325
(cherry picked from commit 56c971373c)
deselect remove_data_dir_of_dead_node event from test_random_failures
due to ussue #20751
(cherry picked from commit 9b0e15678e)
Closesscylladb/scylladb#21138
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.
Fixes: #21034
Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.
(cherry picked from commit f8475915fb)
Closesscylladb/scylladb#21107
The SCYLLA-VERSION-GEN file skips updating the SCYLLA-*-FILE files if
the commit hash from SCYLLA-RELEASE-FILE is the same. The original
reason for this was to prevent the date in the version string from
changing if multiple modes are built across midnight
(scylladb/scylla-pkg#826). However - intentionally or not - it serves
another purpose: it prevents an infinite loop in the build process.
If the build.ninja file needs to be rebuilt, the configure.py script
unconditionally calls ./SCYLLA-VERSION-GEN. On the other hand, if one
of the SCYLLA-*-FILE files is updated then this triggers rebuild
of build.ninja. Apparently, this is sufficient for ninja to enter an
infinite loop.
However, the check assumes that the RELEASE is in the format
<build identifier>.<date>.<commit hash>
and assumes that none of the components have a dot inside - otherwise it
breaks and just works incorrectly. Specifically, when building a private
version, it is recommended to set the build identifier to
`count.yourname`.
Previously, before 85219e9, this problem wasn't noticed most likely
because reconfigure process was broken and stopped overwriting
the build.ninja file after the first iteration.
Fix the problem by fixing the logic that extracts the commit hash -
instead of looking at the third dot-separated field counting from the
left side, look at the last field.
Fixes: scylladb/scylladb#21027
(cherry picked from commit 64ca58125e)
Closesscylladb/scylladb#21103
This is a manual backport of #20788
When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables:
1. sstable containing an expired tombstone
2. sstable with additional data
3. sstable containing data which is shadowed by the expired tombstone in sstable 1
If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica.
This change fixes this problem by disabling tombstone garbage collection for pending replicas.
This fixes a problem in Enterprise, but the change is in OSS in order to have as few differences between OSS and Enterprise and to have a common infrastructure for disabling tombstone GC on pending replicas.
Fixes#21090Closesscylladb/scylladb#21061
* github.com:scylladb/scylladb:
test: test tombstone GC disabled on pending replica
tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
database::table: add tombstone_gc_enabled(locator::tablet_id)
seastar extracted `addr2line` python module out back in
e078d7877273e4a6698071dc10902945f175e8bc. but `install.sh` was
not updated accordingly. it still installs `seastar-addr2line`
without installing its new dependency. this leaves us with a
broken `seastar-addr2line` in the relocatable tarball.
```console
$ /opt/scylladb/scripts/seastar-addr2line
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/seastar-addr2line", line 26, in <module>
from addr2line import BacktraceResolver
ModuleNotFoundError: No module named 'addr2line'
```
in this change, we redistribute `addr2line.py` as well. this
should address the issue above.
Fixesscylladb/scylladb#21077
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit da433aad9d)
Closesscylladb/scylladb#21085
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:
- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in filter_for_query(): the map is considered incorrect if the list of replicas contains a node from a data center whose replication factor is 0.
Please note: This PR does not fix the issue found in scylladb/scylladb#20282; it only adds condition checks to prevent undefined behavior in cases of inconsistent inputs.
Refs scylladb/scylladb#20625
As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.
(cherry picked from commit 132358dc92)
(cherry picked from commit ae23d42889)
(cherry picked from commit ad93cf5753)
(cherry picked from commit 8db6d6bd57)
(cherry picked from commit c373edab2d)
Refs #20851Closesscylladb/scylladb#21067
* github.com:scylladb/scylladb:
Add conditions checking for get_read_executor
Avoid an extra call to block_for in db::filter_for_query.
Improve code readability in consistency_level.cc and storage_proxy.cc
tools: Add build_info header with functions providing build type information
tests: Add tests for alter table with RF=1 to RF=0
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.
The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.
However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.
Refs scylladb/scylladb#17575
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ed1e9a1543)
Closesscylladb/scylladb#20722
can_admit_read() returns reason::memory_resources when the permit is queued due
to lack of count resources, and it returns reason::count_resources when the
permit is queued due to lack of memory resources. It's supposed to be the other
way around.
This bug is causing the two counts to be swapped in the stat dumps printed to
the logs when semaphores time out.
(cherry picked from commit 6cf3747c5f)
Closesscylladb/scylladb#21030
During the investigation of scylladb/scylladb#20282, it was discovered that
implementations of speculating read executors have undefined behavior
when called with an incorrect number of read replicas. This PR
introduces two levels of condition checking:
- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in
get_endpoints_for_reading(): the map is considered incorrect the number of
read replica nodes is higher than replication factor. The check is
applied only when built in non release mode.
Please note: This PR does not fix the issue found in scylladb/scylladb#20282;
it only adds condition checks to prevent undefined behavior in cases of
inconsistent inputs.
Refs scylladb/scylladb#20625
(cherry picked from commit c373edab2d)
A new header provides `constexpr` functions to retrieve build
type information: `get_build_type()`, `is_release_build()`,
and `is_debug_build()`. These functions are useful when adding
changes that should be enabled at compile time only for
specific build types.
(cherry picked from commit ae23d42889)
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.
Test for scylladb/scylladb#20625
(cherry picked from commit 132358dc92)
In order to avoid cases during tablet migrations where we garbage
collect tombstones before the data it shadows arrives, we will
disable tombstone GC on pending replicas.
To achieve this we added a tombston_gc_enabled flag to compaction_group.
This flag is updated from updte_effective_repliction_map method of the
tablet_storage_group_manager class.
It was not possible to link to configuration parameters groups in docs/reference/configuration-parameters.rst if they contained a space.
(cherry picked from commit 2247bdbc8c)
Closesscylladb/scylladb#21037
This change adds the flag tombstone_gc_enabled to compaction_group.
The value of this flag will be set in
tablet_storage_group_manager::update_effective_replication_map().
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.
Fixes: #20240
(cherry picked from commit e0c1a51642)
Closesscylladb/scylladb#21022
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.
Fixes: #20039
Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.
(cherry picked from commit 042825247f)
(cherry picked from commit adf453af3f)
(cherry picked from commit 9c5950533f)
(cherry picked from commit 47acdc1f98)
(cherry picked from commit 93d61d7031)
(cherry picked from commit 6676e47371)
(cherry picked from commit 2aabe7f09c)
(cherry picked from commit ee56bbfe61)
Refs #20208Closesscylladb/scylladb#21009
* github.com:scylladb/scylladb:
cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
cql: join new and old KS options in ALTER tablets KS
cql: fix validation of ALTERing RFs in tablets KS
cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
cql: validate RF change for new DCs in ALTER tablets KS
cql: extend test_alter_tablet_keyspace_rf
cql: refactor test_tablets::test_alter_tablet_keyspace
cql: remove unused helper function from test_tablets
- As part of deprecation of IP address usage, warning messages were added when IP addresses specified in the `ignore-dead-nodes` and `--ignore-dead-nodes-for-replace` options for scylla and nodetool.
- Slight optimizations for `utils::split_comma_separated_list`, ` host_id_or_endpoint lists` and `storage_service` remove node operations, replacing `std::list` usage with `std::vector`.
Fixes scylladb/scylladb#19218
Backport: 6.2 as it's not yet released.
(cherry picked from commit 3b9033423d)
(cherry picked from commit a871321ecf)
(cherry picked from commit 9c692438e9)
(cherry picked from commit 6398b7548c)
Refs #20756Closesscylladb/scylladb#20958
* github.com:scylladb/scylladb:
config: Add a warning about use of IP address for join topology and replace operations.
nodetool: Add IP address usage warning for 'ignore-dead-nodes'.
tests: Fix incorrect UUIDs in test_nodeops
utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
This timeout was added to catch reader related deadlocks. We have not
seen such deadlocks for a long time, but we did see false-timeouts
caused by this, see explanation below. Since the cost now outweight the
benefit, remove the timeout altogether.
The false timeout happens during mixed-shard repair. The
`reader_permit::set_timeout()` call is called on the top-level permit
which repair has a handle on. In the case of the mixed-shard repair,
this belongs to the multishard reader. Calling set_timeout() on the
multishard reader has no effect on the actual shard readers, except in
one case: when the shard reader is created, it inherits the multishard
reader's current timeout. As the shard reader can be alive for a long
time, this timeout is not refreshed and ultimately causes a timeout and
fails the repair.
Refs: #18269
(cherry picked from commit 3ebb124eb2)
Closesscylladb/scylladb#20955
During migration cleanup, there's a small window in which the storage
group was stopped but not yet removed from the list. So concurrent
operations traversing the list could work with stopped groups.
During a test which emitted schema changes during migrations,
a failure happened when updating the compaction strategy of a table,
but since the group was stopped, the compaction manager was unable
to find the state for that group.
In order to fix it, we'll skip stopped groups when traversing the
list since they're unused at this stage of migration and going away
soon.
Fixes#20699.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit cf58674029)
Closesscylladb/scylladb#20899
Refs #20686
Refs #15607
In #15060 we added forced new commitlog segment on user initated flush,
mainly so that tests can verify tombstone gc and other compaction related
things, without having to wait for "organic" segment deletion.
Schema commitlog was not included, mainly because we did not have tests
featuring compaction checks of schema related tables, but also because
it was assumed to be lower general througput.
There is however no real reason to not include it, and it will make some
testing much quicker and more predictable.
(cherry picked from commit 60f8a9f39d)
Closesscylladb/scylladb#20705
storage_proxy::cancellable_write_handlers_list::update_live_iterators
assumes that iterators in _live_iterators can be dereferenced, but
the code does not make any attempt to make sure this is the case. The
iterator can be the end iterator which cannot be dereferenced.
The patch makes sure that there is no end iterator in _live_iterators.
Fixesscylladb/scylladb#20874
(cherry picked from commit da084d6441)
Closesscylladb/scylladb#21003
in a3db5401, we introduced the TLS certi authenticator, which is
configured using `auth_certificate_role_queries` option . the
value of this option contains a regular expression. so there are
chances the regular expression is malformatted. in that case,
when converting its value presenting the regular expression to an
instance of `boost::regex`, Boost.Regex throws a `boost::regex_error`
exception, not `std::regex_error`.
since we decided to use Boost.Regex, let's catch `boost::regex_error`.
Refs a3db5401Fixesscylladb/scylladb#20941
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 439c52c7c5)
Closesscylladb/scylladb#20952
Commit aa1270a00c changed most uses
of `assert` in the codebase to `SCYLLA_ASSERT`.
But the comment fixed in this patch is talking specifically about
`assert`, and shouldn't have been changed. It doesn't make sense
after the change.
(cherry picked from commit da7edc3a08)
Closesscylladb/scylladb#20976
Group0 server is often used in asynchronous context, but we do not wait
for them to complete before destroying the server. We already have
shutdown gate for it, so lets use it in those asynch functions.
Also make sure to signal group0 abort source if initialization fails.
Fixes scylladb/scylladb#20701
Backport to 6.2 since it contains af83c5e53e and it made the race easier to hit, so tests became flaky.
(cherry picked from commit ba22493a69)
(cherry picked from commit e642f0a86d)
Refs #20891Closesscylladb/scylladb#21008
* github.com:scylladb/scylladb:
group: hold group0 shutdown gate during async operations
group0: Stop group0 if node initialization fails
Tablets load balancer is unable to process more than a single pending
replica, thus ALTER tablets KS cannot accept an ALTER statement which
would result in creating 2+ pending replicas, hence it has to validate
if the sum of absoulte differences of RFs specified in the statement is
not greter than 1.
(cherry picked from commit ee56bbfe61)
A bug has been discovered while trying to ALTER tablets KS and
specifying only 1 out of 2 DCs - the not specified DC's RF has been
zeroed. This is because ALTER tablets KS updated the KS only with the
RF-per-DC mapping specified in the ALTER tablets KS statement, so if a
DC was ommitted, it was assigned a value of RF=0.
This commit fixes that plus additionally passes all the KS options, not
only the replication options, to the topology coordinator, where the KS
update is performed.
`initial_tablets` is a special case, which requires a special handling
in the source code, as we cannot simply update old initial_tablet's
settings with the new ones, because if only ` and TABLETS = {'enabled':
true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but
rather keep the old value - this is tested by the
`test_alter_preserves_tablets_if_initial_tablets_skipped` testcase.
Other than that, the above mentioned testcase started to fail with
these changes, and it appeared to be an issue with the test not waiting
until ALTER is completed, and thus reading the old value, hence the
test's body has been modified to wait for ALTER to complete before
performing validation.
(cherry picked from commit 2aabe7f09c)
The validation has been corrected with:
1. Checking if a DC specified in ALTER exists.
2. Removing `REPLICATION_STRATEGY_CLASS_KEY` key from a map of RFs that
needs their RFs to be validated.
(cherry picked from commit 6676e47371)
This function assumed that strings passed as arguments will be of
integer types, but that wasn't the case, and we missed that because this
function didn't have any validation, so this change adds proper
validation and error logging.
Arguments passed to this function were forwarded from a call to
`ks_prop_defs::get_replication_options`, which, among rf-per-dc mapping, returns also
`class:replication_strategy` pair. Second pair's member has been casted
into an `int` type and somehow the code was still running fine, but only
extra testing added later discovered a bug in here.
(cherry picked from commit 93d61d7031)
Wait for all outstanding async work that uses group0 to complete before
destroying group0 server.
Fixesscylladb/scylladb#20701
(cherry picked from commit e642f0a86d)
ALTER tablets KS validated if RF is not changed by more than 1 for DCs
that already had replicas, but not for DCs that didn't have them yet, so
specifying an RF jump from 0 to 2 was possible when listing a new DC in
ALTER tablets KS statement, which violated internal invariants of
tablets load balancer.
This PR fixes that bug and adds a multi-dc testcases to check if adding
replicas to a new DC and removing replicas from a DC is honoring the RF
change constraints.
Refs: #20039
(cherry picked from commit 47acdc1f98)
Commit af83c5e53e moved aborting of group0 into the storage service
drain function. But it is not called if node fails during initialization
(if it failed to join cluster for instance). So lets abort on both
paths (but only once).
(cherry picked from commit ba22493a69)
1. Renamed the testcase to emphasize that it only focuses on testing
changing RF - there are other tests that test ALTER tablets KS
in general.
2. Fixed whitespaces according to PEP8
(cherry picked from commit adf453af3f)
`change_default_rf` is not used anywhere, moreover it uses
`replication_factor` tag, which is forbidden in ALTER tablets KS
statement.
(cherry picked from commit 042825247f)
operations.
When the '--ignore-dead-nodes-for-replace' config option contains
IP addresses, a warning will be logged, notifying the user that
using IP addresses with this option is deprecated and will no
longer be supported in the next release.
Fixesscylladb/scylladb#19218
(cherry picked from commit 6398b7548c)
Since we are deprecating the use of IP addresses, a warning message will be printed
if 'nodetool removenode --ignore-dead-nodes' is used with IP addresses.
(cherry picked from commit 9c692438e9)
It was found that the UUIDs used in test_nodeops were
invalid. This update replaces those UUIDs with newly generated
random UUIDs.
(cherry picked from commit a871321ecf)
- utils::split_comma_separated_list now accepts a reference to sstring instead
of a copy to avoid extra memory allocations. Additionally, the results of
trimming are moved to the resulting vector instead of being copied.
- service/storage_service removenode, raft_removenode, find_raft_nodes_from_hoeps,
parse_node_list and api/storage_service::set_storage_service were changed to use
std::vector<host_id_or_endpoint> instead of std::list<host_id_or_endpoint> as
std::vector is a more cache-friendly structure, resulting in better performance.
(cherry picked from commit 3b9033423d)
There are two bits that control whenter replication strategy for a
keyspace will use tablets or not -- the configuration option and CQL
parameter. This patch tunes its parsing to implement the logic shown
below:
if (strategy.supports_tablets) {
if (cql.with_tablets) {
if (cfg.enable_tablets) {
return create_keyspace_with_tablets();
} else {
throw "tablets are not enabled";
}
} else if (cql.with_tablets = off) {
return create_keyspace_without_tablets();
} else { // cql.with_tablets is not specified
if (cfg.enable_tablets) {
return create_keyspace_with_tablets();
} else {
return create_keyspace_without_tablets();
}
}
} else { // strategy doesn't support tablets
if (cql.with_tablets == on) {
throw "invalid cql parameter";
} else if (cql.with_tablets == off) {
return create_keyspace_without_tablets();
} else { // cql.with_tablets is not specified
return create_keyspace_without_tablets();
}
}
closes: #20088
In order to enable tablets "by default" for NetworkTopologyStrategy
there's explicit check near ks_prop_defs::get_initial_tablets(), that's
not very nice. It needs more care to fix it, e.g. provide feature
service reference to abstract_replication_strategy constructor. But
since ks_prop_defs code already highjacks options specifically for that
strategy type (see prepare_options() helper), it's OK for now.
There's also #20768 misbehavior that's preserved in this patch, but
should be fixed eventually as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit ebedc57300)
Closesscylladb/scylladb#20927
Fixes#20862
With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.
Test included.
(cherry picked from commit ee5e71172f)
Closesscylladb/scylladb#20902
Currently, node ops virtual task gathers its children from all nodes contained
in a sum of service::topology::normal_nodes and service::topology::transition_nodes.
The maps may contain nodes that are down but weren't removed yet. So, if a user
requests the status of a node ops virtual task, the task's attempt to retrieve
its children list may fail with seastar::rpc::closed_error.
Filter out the tasks that are down in node_ops::task_manager_module::get_nodes.
Fixes: #20843.
(cherry picked from commit a558abeba3)
Closesscylladb/scylladb#20898
This fixes a use-after-free bug when parsing clustering key across
pages.
Also includes a fix for allocating section retry, which is potentially not safe (not in practice yet).
Details of the first problem:
Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.
To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.
In 93482439, the promoted index cursor was optimized to avoid
fully page copy when parsing index blocks. Instead, parser is
given a temporary_buffer which is a view on the page.
A bit earlier, in b1b5bda, the parser was changed to keep shared
fragments of the buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.
If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.
The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.
This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.
Details of the solution:
We adapt page_view to a temporary_buffer-like API. For this, a new concept
is introduced called ContiguousSharedBuffer. We also change parsers so that
they can be templated on the type of the buffer they work with (page_view vs
temporary_buffer). This way we don't introduce indirection to existing algorithms.
We use page_view instead of temporary_buffer in the promoted
index parser which works with page cache buffers. page_view can be safely
shared via share() and stored across allocating sections. It keeps hold to the
LSA buffer even across allocating sections by the means of cached_file::page_ptr.
Fixes#20766
(cherry picked from commit 8aca93b3ec)
(cherry picked from commit ac823b1050)
(cherry picked from commit 93bfaf4282)
(cherry picked from commit c0fa49bab5)
(cherry picked from commit 29498a97ae)
(cherry picked from commit c15145b71d)
(cherry picked from commit 7670ee701a)
(cherry picked from commit c09fa0cb98)
(cherry picked from commit 0279ac5faa)
(cherry picked from commit 8e54ecd38e)
(cherry picked from commit b5ae7da9d2)
Refs #20837Closesscylladb/scylladb#20905
* github.com:scylladb/scylladb:
sstables: bsearch_clustered_cursor: Add trace-level logging
sstables: bsearch_clustered_cursor: Move definitions out of line
test, sstables: Verify parsing stability when allocating section is retried
test, sstables: Verify parsing stability when buffers cross page boundary
sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
cached_file: Adapt page_view to ContiguousSharedBuffer
cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
sstables, utils: Allow parsers to work with different buffer types
sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
In order to later use the formatter for the inner class
promoted_index_block, which is defined out of line after
cached_promoted_index class definition.
(cherry picked from commit 8e54ecd38e)
This fixes a use-after-free bug when parsing clustering key across
pages.
Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.
To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.
In b1b5bda, the parser was changed to keep shared fragments of the
buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.
If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.
The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.
This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.
The solution is to use page_view instead of temporary_buffer, which
can be safely shared via share() and stored across allocating
section. The page_view maintains its hold to the LSA buffer even
across allocating sections.
Fixes#20766
(cherry picked from commit 7670ee701a)
Currently, parsers work with temporary_buffer<char>. This is unsafe
when invoked by bsearch_clustered_cursor, which reuses some of the
parsers, and passes temporary_buffer<char> which is a view onto LSA
buffer which comes from the index file page cache. This view is stable
only around consume(). If parsing requires more than one page, it will
continue with a different input buffer. The old buffer will be
invalid, and it's unsafe for the parser to store and access
it. Unfortunetly, the temporary_buffer API allows sharing the buffer
via the share() method, which shares the underlying memory area. This
is not correct when the underlying is managed by LSA, because storage
may move. Parser uses this sharing when parsing blobs, e.g. clustering
key components. When parsing resumes in the next page, parser will try
to access the stored shared buffers pointing to the previous page,
which may result in use-after-free on the memory area.
In prearation for fixing the problem, parametrize parsers to work with
different kinds of buffers. This will allow us to instantiate them
with a buffer kind which supports sharing of LSA buffers properly in a
safe way.
It's not purely mechanical work. Some parts of the parsing state
machine still works with temporary_buffer<char>, and allocate buffers
internally, when reading into linearized destination buffer. They used
to store this destination in _read_bytes vector, same field which is
used to store the shared buffers. Now it's not possible, since shared
buffer type may be different than temporary_buffer<char>. So those
paths were changed to use a new field: _read_bytes_buf.
(cherry picked from commit c0fa49bab5)
When reset() is done due to allocating section retry, it can be
theoretically in an arbitrary point. So we should not assume that it
finished parsing and state was reset by previous parsing. We should
reset all the fields.
(cherry picked from commit 93bfaf4282)
Parser's state was not reset when allocating section was retried.
This doesn't cause problems in practice, because reserves are enough
to cover allocation demands of parsing clustering keys, which are at
most 64K in size. But it's still potentially unsafe and needs fixing.
(cherry picked from commit 8aca93b3ec)
For each new node added to the raft config populate it's ID to IP mapping in raft address map from the gossiper. The mapping may have expired if a node is added to the raft configuration long after it first appears in the gossiper.
Fixes scylladb/scylladb#20600
Backport to all supported versions since the bug may cause bootstrapping failure.
(cherry picked from commit bddaf498df)
(cherry picked from commit 9e4cd32096)
Refs #20601Closesscylladb/scylladb#20847
* github.com:scylladb/scylladb:
test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
group0: make sure that address map has an entry for each new node in the raft configuration
ID->IP mapping is added to the raft address map when the mapping first
appears in the gossiper, but it is added as expiring entry. It becomes
non expiring when a node is added to raft configuration. But when a node
joins those two events may be distant in time (since the node's request
may sit in the topology coordinator queue for a while) and mappings may
expire already from the map. This patch makes sure to transfer the
mapping from the gossiper for a node that is added to the raft
configuration instead of assuming that the mapping is already there.
(cherry picked from commit bddaf498df)
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.
Fixes: https://github.com/scylladb/scylladb/issues/20629
Need to be backported since this is a regression
(cherry picked from commit 644e7a2012)
(cherry picked from commit c0939d86f9)
(cherry picked from commit 1b4c255ffd)
Refs #20743Closesscylladb/scylladb#20829
* github.com:scylladb/scylladb:
test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
topology coordinator:: mark node as being replaced earlier
topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
What it called "leader" is actually the destination of the RPC.
Trivial fix, should be backported to all affected versions.
(cherry picked from commit 09c68c0731)
Closesscylladb/scylladb#20826
This commit modifies the Features page in the following way:
- It adds a short introduction and descriptions to each listed feature.
- It hides the ToC (required to control and modify the information on the page,
e.g., to add descriptions, have full control over what is displayed, etc.)
- Removes the info about Enterprise features (following the request not to include
Enterprise info in the OSS docs)
Fixes https://github.com/scylladb/scylladb/issues/20617
Blocks https://github.com/scylladb/scylla-enterprise/pull/4711
(cherry picked from commit da8047a834)
Closesscylladb/scylladb#20811
This PR addresses multiple issues with alternator batch metrics:
1. Rename the metrics to scylla_alternator_batch_item_count with op=BatchGetItem/BatchWriteItem
2. The batch size calculation was wrong and didn't count all items in the batch.
3. Add a test to validate that the metrics values increase by the correct value (not just increase). This also requires an addition to the testing to validate ops of different metrics and an exact value change.
Needs backporting to allow the monitoring to use the correct metrics names.
Fixes#20571
(cherry picked from commit 515857a4a9)
(cherry picked from commit 905408f764)
(cherry picked from commit 4d57a43815)
(cherry picked from commit 8dec292698)
Refs #20646Closesscylladb/scylladb#20758
* github.com:scylladb/scylladb:
alternator:test_metrics test metrics for batch item count
alternator:test_metrics Add validating the increased value
alternator: Fix item counting in batch operations
Alterntor rename batch item count metrics
This commit fixes a link to the Manager by adding a missing underscore
to the external link.
(cherry picked from commit aa0c95c95c)
Closesscylladb/scylladb#20710
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.
(cherry picked from commit c0939d86f9)
During replace with the same IP a node may get queries that were intended
for the node it was replacing since the new node declares itself UP
before it advertises that it is a replacement. But after the node
starts replacing procedure the old node is marked as "being replaced"
and queries no longer sent there. It is important to do so before the
new node start to get raft snapshot since the snapshot application is
not atomic and queries that run parallel with it may see partial state
and fail in weird ways. Queries that are sent before that will fail
because schema is empty, so they will not find any tables in the first
place. The is pre-existing and not addressed by this patch.
(cherry picked from commit 644e7a2012)
The test performs consecutive schema changes in RECOVERY mode. The
second change relies on the first. However the driver might route the
changes to different servers and we don't have group 0 to guarantee
linearizability. We must rely on the first change coordinator to push
the schema mutations to other servers before returning, but that only
happens when it sees other servers as alive when doing the schema
change. It wasn't guaranteed in the test. Fix this.
Fixesscylladb/scylladb#20791
Should be backported to all branches containing this test to reduce
flakiness.
(cherry picked from commit f390d4020a)
Closesscylladb/scylladb#20807
In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.
This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#20271
(cherry picked from commit b25b8dccbd)
Closesscylladb/scylladb#20799
Currently the function calls boost::partial_sort with a middle
iterator that might be out of bound and cause undefined behavior.
Check the vector size, and do a partial sort only if its longer
than `max_sstables`, otherwise sort the whole vector.
Fixesscylladb/scylladb#20608
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20609
(cherry picked from commit 39ce358d82)
Refs: scylladb/scylladb#20609
This patch adds tests for the batch operations item count.
The tests validate that the metrics tracking the number of items
processed in a batch increase by the correct amount.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 8dec292698)
The `check_increases_operation` now allows override the checked metric.
Additionally, a custom validation value can now be passed, which make it
possible to validate the amount by which a value has changed, rather
than just validating that the value increased.
The default behavior of validating that values have increased remains
unchanged, ensuring backward compatibility.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 4d57a43815)
This patch fixes the logic for counting items in batch operations.
Previously, the item count in requests was inaccurate, it count the
number of tabels in get_item and the request_items in write_items.
The new logic correctly counts each individual item in `BatchGetItem`
and `BatchWriteItem` requests.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 905408f764)
This patch renames metrics tracking the total number of items in a batch
to `scylla_alternator_batch_item_count`. It uses the existing `op` label to
differentiate between `BatchGetItem` and `BatchWriteItem` operations.
Ensures better clarity and distinction for batch operations in monitoring.
This an example of how it looks like:
# HELP scylla_alternator_batch_item_count The total number of items processed across all batches
# TYPE scylla_alternator_batch_item_count counter
scylla_alternator_batch_item_count{op="BatchGetItem",shard="0"} 4
scylla_alternator_batch_item_count{op="BatchWriteItem",shard="0"} 4
(cherry picked from commit 515857a4a9)
In https://github.com/scylladb/scylladb/pull/18729, we introduced a new statement tenant $maintenance, but the change wasn't protected by any cluster feature.
This wasn't a problem for OSS, since unknown isolation cookie just uses default scheduling group. However, in enterprise that leads to creating a service level on not-upgraded nodes, which may end up in an error if user create maximum number of service levels.
This patch adds a cluster feature to guard adding the new tenant. It's done in the way to handle two upgrade scenarios:
version without $maintenance tenant -> version with $maintenance tenant guarded by a feature
version with $maintenance tenant but not guarded by a feature -> version with $maintenance tenant guarded by a feature
The PR adds enabled flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create a connection, but it can be used to accept an incoming connection.
The $maintenance tenant is added to the config as disabled and it gets enabled once the corresponding feature is enabled.
Fixes https://github.com/scylladb/scylladb/issues/20070
Refs https://github.com/scylladb/scylla-enterprise/issues/4403
(cherry picked from commit d44844241d)
(cherry picked from commit 71a03ef6b0)
(cherry picked from commit b4b91ca364)
Refs https://github.com/scylladb/scylladb/pull/19802Closesscylladb/scylladb#20690
* github.com:scylladb/scylladb:
message/messaging_service: guard adding maintenance tenant under cluster feature
message/messaging_service: add feature_service dependency
message/messaging_service: add `enabled` flag to statement tenants
Adding a new tenant needs to be done under cluster feature protection.
However it wasn't the case for adding `$maintenance` statement tenant
and to fix it we need to support an upgrade from node which doesn't
know about maintenance tenant at all and from one which uses it without
any cluster feature protection.
This commit adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create
a connection, but it can be used to accept an incoming connection.
(cherry-picked from d44844241d)
Allow create_pending_deletion_log to delete a bunch of sstables
potentially resides in different prefixes (e.g. in the base directory
and under staging/).
The motivation arises from table::cleanup_tablet that calls compaction_group::cleanup on all cg:s via cleanup_compaction_groups. Cleanup, in turn, calls delete_sstables_atomically on all sstables in the compaction_group, in all states, including the normal state as well as staging - hence the requirement to support deleting sstables in different sub-directories.
Also, apparently truncate calls delete_atomically for all sstables too, via table::discard_sstables, so if it happened to be executed during view update generation, i.e. when there are sstables in staging, it should hit the assertion failure reported in https://github.com/scylladb/scylladb/issues/18862 as well (although I haven't seen it yet, but I see no reason why it would happen). So the issue was apparently present since the initial implementation of the pending_delete_log. It's just that with tablet migration it is more likely to be hit.
Fixesscylladb/scylladb#18862
Needs backport to 6.0 since tablets require this capability
Closesscylladb/scylladb#19555
* github.com:scylladb/scylladb:
sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
sstables: storage: keep base directory in base class
sstables: storage: define opened_directory in header file
sstable_directory: use only dirlog
The cleanup compaction task is a maintenance operation that runs after
topology changes. So, run it under the maintenance scheduling group to
avoid interference with regular compaction tasks. Also remove the share
allocations done by the cleanup task, as they are unnecessary when
running under the maintenance group.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20582
Since 3c7af28725, the cqlsh submodule no longer contains a
bin/cqlsh shell script. This broke the supermodule's bin/cqlsh
shortcut.
Fix it by invoking cqlsh.py directly.
Closesscylladb/scylladb#20591
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.
Ignore cleanup of deallocated storage groups.
Fixes: #19752.
Needs to be backported to all branches with tablets (6.0 and later)
Closesscylladb/scylladb#20584
* github.com:scylladb/scylladb:
test: check if cleanup of deallocated sg is ignored
replica: ignore cleanup of deallocated storage group
To drop a semaphore it should not be held by anyone, so we need to
release out units before checking if a semaphore can be dropped.
Fixes: scylladb/scylladb#20602Closesscylladb/scylladb#20607
as `_bucket` is an `unordered_map<bucket_id, timestamp_bucket_writer>`,
when writing to a given bucket, we try to create a writer with the
specified bucket id, so the returned iterator should point to a node
whose `first` element is always the bucket id.
so, there is no need to reference `it` for the bucket id, let's just
reference the parameter. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20598
If a regular row isn't present, no regular column restriction
(say, r=3) can pass since all regular columns are presented as NULL,
and we don't have an IS NULL predicate. Yet we just ignore it.
Handle the restriction on a missing column by return false, signifying
the row was filtered out.
We have to move the check after the conditional checking whether there's
any restriction at all, otherwise we exit early with a false failure.
Unit test marked xfail on this issue are now unmarked.
A subtest of test_tombstone_limit is adjusted since it depended on this
bug. It tested a regular column which wasn't there, and this bug caused
the filter to be ignored. Change to test a static column that is there.
A test for a bug found while developing the patch is also added. It is
also tested by test_tombstone_limit, but better to have a dedicated test.
Fixes#10357Closesscylladb/scylladb#20486
This option was silently broken when --enable-tablet's default changed
from false to true. The reason is that when --vnodes is passed, run only
removes --enable-tablets=true from scylla's command line. With the new
default this is not enough, we need to explicitely disable tablets to
override the default.
Closesscylladb/scylladb#20462
This is a translation of Cassandra's CQL unit test source file
OperationFctsTest.java into our cql-pytest framework.
This is a massive test suite (over 800 lines of code) for Cassandra's
"arithmetic operators" CQL feature (CASSANDRA-11935), which was added
to Cassandra almost 8 years ago (and reached Cassandra 4.0), but we
never implemented it in Scylla.
All of the tests in suite fail in ScyllaDB due to our lack of this
feature:
Refs #2693: Support arithmetic operators
One test also discovered a new issue:
Refs #20501: timestamp column doesn't allow "UTC" in string format
All the tests pass on Cassandra.
Some of the tests insist on specific error message strings and specific
precision for decimal arithmetic operations - where we may not necessarily
want to be 100% compatible with Cassandra in our eventual implementation.
But at least the test will allow us to make deliberate - and not accidental -
deviations from compatibility with Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20502
Currently, attempt to cleanup deallocated storage group throws
an exception. Failed tablet cleanup is retried, stucking
in an endless loop.
Ignore cleanup of deallocated storage group.
Currently, test.py will throw an error if the test will use
BOOST_DATA_TEST_CASE. test.py as a first step getting all test
functions in the file, but when BOOST_DATA_TEST_CASE will be used the
output will have additional lines indicating parametrized test that
test.py can not handle. This commit adds handling this case, as a caveat
all tests should start from 'test' or they will be ignored.
Closes: #20530Closesscylladb/scylladb#20556
This function was obsoleted by schema_builder some time ago. Not to patch all its callers, that helper became wrapper around it. Remained users are all in tests, and patching the to use builder directory makes the code shorter in many cases.
Closesscylladb/scylladb#20466
* github.com:scylladb/scylladb:
schema: Ditch make_shared_schema() helper
test: Tune up indentation in uncompressed_schema()
test: Make tests use schema_builder instead of make_shared_schema
Since #14152 creation of an sstable takes table dir and its state. The
test in question wants to create and sstable in upload/ subdir and for
that it used to maintain full "cf.dir/upload" path, which is not
required any more.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20514
There are two currently -- upload_sink_base and do_upload_file. This PR merges as much code as possible (spoiler: it's already mostly copy-n-pase-d, so squashing is pretty straightforward)
Closesscylladb/scylladb#20568
* github.com:scylladb/scylladb:
s3/client: Reuse class multipart_upload in do_upload_file
s3/client: Split upload_sink_base class into two
the `seastar/core/print.hh` header is no longer required by
`auth/resource.hh`. this was identified by clang-include-cleaner.
As the code is audited, wecan safely remove the #include directive.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20575
When purging regular tombstone consult the min_live_timestamp, if available.
This is safe since we don't need to protect dead data from resurrection, as it is already dead.
For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.
If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data.
In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable.
If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp.
Fixesscylladb/scylladb#20423Fixesscylladb/scylladb#20424
> [!NOTE]
> no backport needed at this time
> We may consider backport later on after given some soak time in master/enterprise
> since we do see tombstone accumulation in the field under some materialized views workloads
Closesscylladb/scylladb#20446
* github.com:scylladb/scylladb:
cql-pytest: add test_compaction_tombstone_gc
sstable_compaction_test: add mv_tombstone_purge_test
sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection
sstable_compaction_test: tombstone_purge_test: add testlog debugging
sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp
sstable, compaction: add debug logging for extended min timestamp stats
compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats
compaction: define max_purgeable_fn
tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh
sstables: scylla_metadata: add ext_timestamp_stats
compaction_group, storage_group, table_state: add extended timestamp stats getters
sstables, memtable: track live timestamps
memtable_encoding_stats_collector: update row_marker: do nothing if missing
Since JMX server is deprecated, drop them from submodule, build system
and package definition.
Related scylladb/scylla-tools-java#370
Related #14856
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closesscylladb/scylladb#17969
Uploading a file is implemented by the do_upload_file class. This class
re-implements a big portion of what's currently in multipart_upload one.
This patch makes the former class inherit from the latter and removes
all the duplication from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This class implements two facilities -- multipart upload protocol itself
plus some common parts of upload_sink_impl (in fact -- only close() and
plugs put(packet)).
This patch aplits those two facilities into two classes. One of them
will be re-used later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the server_impl::applier_fiber is paused by a co_await at line raft/server.cc:1375:
```
co_await override_snapshot_thresholds();
```
a new snapshot may be applied, which updates the actual values of the log's last applied
and snapshot indexes. As a result, the new snapshot index could become higher than the
old value stored in _applied_idx at line raft/server.cc:1365, leading to an assertion
failure in log::last_conf_for().
Since error injection is disabled in release builds, this issue does not affect production releases.
This issue was introduced in the following commit
9dfa041fe1,
when error injection was added to override the log snapshot configuration parameters.
How to reproduce:
1. Build debug version of randomized_nemesis_test
```
ninja-build build/debug/test/raft/randomized_nemesis_test
```
2. Run
```
parallel --halt now,fail=1 -j20 'build/debug/test/raft/randomized_nemesis_test \
--run_test=test_frequent_snapshotting -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 \
--kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --default-log-level \
trace > tmp/logs/eraseme_{}.log 2>&1 && rm tmp/logs/eraseme_{}.log' ::: {1..1000}
```
Fixesscylladb/scylladb#20363Closesscylladb/scylladb#20555
This PR hides the ToC on the Alternator page, as we don't need it, especially at the end of the page.
The ToC must be hidden rather than removed because removing it would, in turn, remove the "Getting Started With ScyllaDB Alternator" and "ScyllaDB Alternator for DynamoDB users" from the page tree and make them inaccessible.
In addition, this PR moves Alternator higher in the page tree.
Fixes https://github.com/scylladb/scylladb/issues/19823Closesscylladb/scylladb#20565
* github.com:scylladb/scylladb:
doc: move Alternator higher in the page tree
doc: hide the redundant ToC on the Alternator page
A CreateTable request defines the KeySchema of the base table and each
of its GSIs and LSIs. It also needs to give an AttributeDefinition for
each attribute used in a KeySchema - which among other things specifies
this attribute's type (e.g., S, N, etc.). Other, non-key, attributes *do
not* have a specified type, and accordingly must not be mentioned in
AttributeDefinitions.
Before this patch, Alternator just ignored unused AttributeDefinitions
entries, whereas DynamoDB throws an error in this case. This patch fixes
Alternator's behavior to match DynamoDB's - and adds a test to verify this.
Besides being more error-path-compatible with DynamoDB, this extra check
can also help users: We already had one user complaining that an
AttributeDefinitions setting he was using was ignored, not realizing
that it wasn't used by any KeySchema. A clear error message would have
saved this user hours of investigation.
Fixes#19784.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20378
recently, we are observing errors like:
```
stderr: error running operation: rjson::error (JSON SCYLLA_ASSERT failed on condition 'false', at: 0x60d6c8e 0x4d853fd 0x50d3ac8 0x518f5cd 0x51c4a4b 0x5fad446)
```
we only passed `false` to the `RAPIDJSON_ASSERT()` macro, so what we
have is but the type of the error (rjson::error) and a backtrace.
would be better if we can have more information without recompiling
or fetching the debug symbols for decipher the backtrace.
Refs scylladb/scylladb#20533
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20539
This commit hides the ToC, as we don't need it, especially at the end of the page.
The ToC must be hidden rather than removed because removing it would, in turn,
remove the "Getting Started With ScyllaDB Alternator" and "ScyllaDB Alternator for DynamoDB users"
from the page tree and make them inaccessible.
add test for issue when writes in commitlog segments pinned to another table can be resurrected.
This test based on dtest code published in #14870 and adapted for community version.
It's a regression test for #15060 fix and should fail before this patch and succeed afterwards.
Refs #14870, #15060Closesscylladb/scylladb#20331
When we introduced optimized clang at 6e487a4, we dropped multiarch build on frozen toolchain, because building clang on QEMU emulation is too heavy.
Actually, even after the patch merged, there are two mode which does not build clang, --clang-build-mode INSTALL_FROM and --clang-build-mode SKIP.
So we should restore multiarch build only these mode, and keep skipping on INSTALL mode since it builds clang.
Since we apply multiarch on INSTALL_FROM mode, --clang-archive replaced
to --clang-archive-x86_64 and --clang-archive-aarch64.
Note that this breaks compatibility of existing clang archive, since it
changes clang root directory name from llvm-project to llvm-project-$ARCH.
Closes#20442Closesscylladb/scylladb#20444
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.
that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:
```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
265 | return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
| ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
4290 | format(format_string<_Args...> __fmt, _Args&&... __args)
| ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
143 | format(fmt::format_string<A...> fmt, A&&... a) {
| ^
```
in this change, we
change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
`seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
because, `sstring::operator std::basic_string` would incur a deep
copy.
we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This PR introduces a new file data source implementation for uncompressed SSTables that will be validating the checksum of each chunk that is being read. Unlike for compressed SSTables, checksum validation for uncompressed SSTables will be active for scrub/validate reads but not for normal user reads to ensure we will not have any performance regression.
It consists of:
* A new file data source for uncompressed SSTables.
* Integration of checksums into SSTable's shareable components. The validation code loads the component on demand and manages its lifecycle with shared pointers.
* A new `integrity_check` flag to enable the new file data source for uncompressed SSTables. The flag is currently enabled only through the validation path, i.e., it does not affect normal user reads.
* New scrub tests for both compressed and uncompressed SSTables, as well as improvements in the existing ones.
* A change in JSON response of `scylla validate-checksums` to report if an uncompressed SSTable cannot be validated due to lack of checksums (no `CRC.db` in `TOC.txt`).
Refs #19058.
New feature, no backport is needed.
Closesscylladb/scylladb#20207
* github.com:scylladb/scylladb:
test: Add test to validate SSTables with no checksums
tools: Fix typo in help message of scylla validate-checksums
sstables: Allow validate_checksums() to report missing checksums
test: Add test for concurrent scrub/validate operations
test: Add scrub/validate tests for uncompressed SSTables
test/lib: Add option to create uncompressed random schemas
test: Add test for scrub/validate with file-level corruption
test: Check validation errors in scrub tests
sstables: Enable checksum validation for uncompressed SSTables
sstables: Expose integrity option via crawling mutation readers
sstables: Expose integrity option via data_consume_rows()
sstables: Add option for integrity check in data streams
sstables: Remove unused variable
sstables: Add checksum in the SSTable components
sstables: Introduce checksummed file data source implementation
sstables: Replace assert with on_internal_error
Fixes#20543
In cql_test_env, if cfg_in.ms_listen is set, we try to get a free port for the current test on
which message service rpc can bind. This to allow multiple tests in parallel.
However, we just do this by using random and getting a number, not actually verifying it against
host ports in use.
This is complicated further by the fact that port reuse is effectively disabled in seastar
(see reactor::posix_reuseport_detect()). Due to this, the solution applied here is a combo
of
* Create temp socket with port = 0 to get a previously free port
* Close socket right before listen (to handle reuse not working)
* Retry on EADDRINUSE
Closesscylladb/scylladb#20547
In 880058073b a new column (request_type)
was added to topology_requests table, but the table's schema version
wasn't changed. Due to that during cluster upgrade, the old and the new
versions occur but they are not distinguishable.
Add offset to schema version of topology_requests table if it contains
request_type column.
Fixes: #20299.
Closesscylladb/scylladb#20402
Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations.
The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table.
New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table.
The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility.
When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows.
Fixesscylladb/scylladb#15329
dtest https://github.com/scylladb/scylla-dtest/pull/4827Closesscylladb/scylladb#19745
* github.com:scylladb/scylladb:
view: test view_build_status table with node replace
test/pylib: use view_build_status_v2 table in wait_for_view
view_builder: common write view_build_status function
view_builder: improve migration to v2 with intermediate phase
view: delete node rows from view_build_status on node removal
view: sanitize view_build_status during migration
view: make old view_build_status table a virtual table
replica: move streaming_reader_lifecycle_policy to header file
view_builder: test view_build_status_v2
storage_service: add view_build_status to raft snapshot
view_builder: migration to v2
db:system_keyspace: add view_builder_version to scylla_local
view_builder: read view status from v2 table
view_builder: introduce writing status mutations via raft
view_builder: pass group0_client and qp to view_builder
view_builder: extract sys_dist status operations to functions
db:system_keyspace: add view_build_status_v2 table
In a previous patch we extended the return status of
`sstables::validate_checksums()` to report if an SSTable cannot be
validated due to a missing CRC component (i.e., CRC.db does not appear
in TOC.txt).
Add a test case for this.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Change the return type of `sstable::validate_checksums()` from binary
(valid/invalid) to a ternary (valid/invalid/no_checksums). The third
status represents uncompressed SSTables without a CRC component (no
entry for CRC.db in the TOC).
Also, change the JSON response of `sstable validate-checksums` to expose
the new status. Replace the boolean value for valid/invalid checksums
with an object that contains two boolean keys: one that indicates if the
SSTable has checksums, and one that indicates if the checksums are valid
or not. The second key is optional and appears only if the SSTable has
checksums.
Finally, update the documentation to reflect the changes in the API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Theoretically it is possible to launch more than one scrub instances
simultaneously. Since the checksum component is a shared resource,
accesses have to be synchronized.
Add a test that launches two scrub operations in validate mode and
ensures that the checksum component is loaded once, referenced by all
scrub instances via shared pointers, and deleted once the scrub
operations finish. Introduce an injection point to achieve concurrent
execution of scrubs.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Currently the unit tests check scrub in validate mode against compressed
SSTables only. Mirror the tests for uncompressed SSTables as well.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend the `random_schema_specification` to support creating both
compressed and uncompressed schemas.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Currently, we test scrub/validate only against a corrupted SSTable with
content-level corruption (out-of-order partition key).
Add a test for file-level corruption as well. This should trigger the
checksum check in the underlying compressed file data source
implementation.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Scrub was extended in PR #11074 to report validation errors but the
unit tests were not updated.
Update the tests to check the validation errors reported by scrub.
Validation errors must be zero for valid SSTables and non-zero for
invalid SSTables.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend the `sstable::validate()` to validate the checksums of
uncompressed SSTables. Given that this is already supported for
compressed SSTables, this allows us to provide consistent behavior
across any type of SSTable, be it either compressed or uncompressed.
The most prominent use case for this is scrub/validate, which is now
able to detect file-level corruption in uncompressed SSTables as
well.
Note that this change will not affect normal user reads which skip
checksum validation altogether.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add a new boolean parameter in `sstable::data_stream()` to
enable/disable integrity mechanisms in the underlying data streams.
Currently, this only affects uncompressed SSTables and it allows to
enable/disable checksum validation on each chunk. The validation happens
transparently via the checksummed data source implementation.
The reason we need this option is to allow differentiating the behavior
between normal user reads and scrub/validate reads. We would like to
enable scrub to verify checksums for uncompressed SSTables, while
leaving normal user reads unchanged for performance reasons (read
amplification due to round up of reads to chunk size and loading of the
CRC component).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Remove unused stream variable from `sstable::data_stream()`. This was
introduced in commit 47e07b787e but never used.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Uncompressed SSTables store their checksums in a separate CRC.db file.
Add this in the list of SSTable components.
Since this component is used only for validation, load the component
on-demand for validation tasks and delete it when all validation tasks
finish. In more detail:
- Make the checksum component shareable and weakly referencable.
Also, add a constructor since it is no longer an aggregate.
- Use a weak pointer to store a non-owning reference in the components
and a shared pointer to keep the object alive while validation runs.
Once validation finishes, the component should be cleaned up
automatically.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Introduce a new data source implementation for uncompressed SSTables.
This is just a thin wrapper for a raw data source that also performs
checksum validation for each chunk. This way we can have consistent
behavior for compressed and uncompressed SSTables.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.
Fixes#20301
Requires a backport to 6.0 and 6.1 as they have the same issue.
Closesscylladb/scylladb#20471
* github.com:scylladb/scylladb:
cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
database::get_all_tables_flushed_at: fix return value
Test tombstone garbage collection with:
1. conflicting live data in memtable (verifying there is no regression
in this area)
2. deletion in memtable (reproducing scylladb/scylladb#20423)
3. materialized view update in memtable (reproducing scylladb/scylladb#20424)
in materialized_views
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than forging a timestamp from the gc_clock
just use `next_timestamp` do it can be considered
for tomebstone purging purposes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When purging regular tombstone consult the min_live_timestamp,
if available.
For shadowable_tombstones, consult the
min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.
If both are missing, fallback to the legacy
(and inaccurate) min_timestamp.
Fixesscylladb/scylladb#20423Fixesscylladb/scylladb#20424
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before we add a new, is_shadowable, parameter to it.
And define global `can_always_purge` and `can_never_purge`
functions, a-la `always_gc` and `never_gc`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And define `never_gc` globally, same as `always_gc`
Before adding a new, is_shadowable parameter to it.
Since it is used in the context of compaction
it better fits compaction_garbage_collector header
rather than tombstone.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Store and retrieve the optional extended timestamp statistics
(min_live_timestamp and min_live_row_marker_timestamp)
in the scylla_metadata component.
Note that there is no need for a cluster feature to
store those attributes since the scylla_metadata
on-disk format is extensible so that old sstables
can be read by new versions, seeing the extra stats
is missing, and new sstables can be read by old
versions that ignore unknown scylla metadata section types.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To return the minimum live timestamp and live row-marker
timestamp across a compaction_group, storage_group, or
table_state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When garbage collecting tombstones, we care only
about shadowing of live data. However, currently
we track min/max timestamp of both live and dead
data, but there is no problem with purging tombstones
that shadow dead data (expired or shdowed by other
tombstones in the sstable/memtable).
Also, for shadowable tombstones, we track live row marker timestamps
separately since, if the live row marker timestamp is greater than
a shadowable tombstone timestamp, then the row marker
would shadow the shadowable tombstone thus exposing the cells
in that row, even if their timestasmp may be smaller
than the shadow tombstone's.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refs #18161
Yet another approach to dealing with large commitlog submissions.
We handle oversize single mutation by adding yet another entry
typo: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.
Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
*
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.
On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.
Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.
This approach bumps the file format (docs to come).
To ensure "atomicity" we both force synchronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.
Closesscylladb/scylladb#19472
* github.com:scylladb/scylladb:
commitlog/database: Make some commitlog options updatable + add feature listener
features/config: Add feature for fragmented commitlog entries
docs: Add entry on commitlog file format v4
commitlog_test: Add more oversized cases
commitlog_replayer: Replay segments in order created
commitlog_replayer: Use replay state to support fragmented entries
commitlog_replayer: coroutinize partly
commitlog: Handle oversized entries
If the row_marker is missing then its timestamp
is missing as well, so there's no point
calling update_timestamp for it. Better return early.
This should cause no functional change.
The following patch will add more logic
for tracking extended timestamp stats.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently storage service is drained while group0 is still active. The
draining stops commitlogs, so after this point no more writes are
possible, but if group0 is still active it may try to apply commands
which will try to do writes and they will fail causing group0 state
machine errors. This is benign since we are shutting down anyway, but
better to fix shutdown order to keep logs clean.
Fixesscylladb/scylladb#19665
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.
Fixes#20301
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
If you are using dbuild, that's where test.py needs to run.
Also, replace 'Docker image' with the more generic 'container' term.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#20336
ScyllaDB doesn't support custom compressors. The available compressors
are the only available ones, not the default ones.
Adjust the text to reflect this.
Closesscylladb/scylladb#20225
The test-env in question is mostly started in one-shard mode. Also there
are several boost tests that start sharded<> environment. In that case
instances on different shards live in different temp dirs. That's not
critical yet, but better to have single directory for the whole test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20412
In test.py every asyncio task spawned during the test must be finished before the next test, otherwise, tests might affect each other results.
The developers are responsible for writing asyncio code in a way that doesn’t leave task objects unfinished.
Test.py has a mechanism that helps test writers avoid such tasks. At the end of each test case, it verifies that the test did not produce/leave any tasks and sets an event object that fails the next test at the start if this is the case(issue https://github.com/scylladb/scylladb/issues/16472)
The problem with this was that breaking the next test was counterintuitive, and the logging for this situation was insufficient and unobvious.
notes: Task.cancel() is not an option to avoid task leakage
1) Calling cancel() Does Not Cancel The Task : the cancel() method just request that the target task cancel.
2) Calling cancel() Does Not Block Until The Task is Cancelled: If the caller needs to know the task is cancelled and done, it could await for the target
3) In particular PR, task.cancel() cancell task on client(ManagerClient) but not on http server(ScyllaManager). so "await" is needed.
Closesscylladb/scylladb#20012
The test in question generates a bunch of table_for_tests objects and
creates sstables for each. For that it calls test_env::make_sstable(),
but it can be made shorter, by calling table method directly.
The hidden goal of this change is to remove the explicit caller of
table::dir() method. The latter is going away.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20451
This includes
- coroutinization
- elimination of unused overload
Closesscylladb/scylladb#20456
* github.com:scylladb/scylladb:
test: Squash two open_sstables() helper together
test: Coroutinize open_sstables() helper
This series prepares us for working on #11567 - allow adding a GSI to a pre-existing table. This will require changing the implementation of GSIs in Alternator to not use real columns in the schema for the materialized view, and instead of a computed column - a function which extracts the desired member from the `:attrs` map and de-serializes it.
This series does not contain the GSI re-implementation itself. Rather it contains a few small cleanups and mostly - new regression tests that cover this area, of adding and removing a GSI, and **using** a GSI, in more details than the tests we already had. I developed most of these tests while working on **buggy** fixes for #11567; The bugs in those implementations were exposed by the tests added here - they exposed bugs both in the new feature of adding or removing a GSI, and also regressions to the ordinary operation of GSI. So these tests should be helpful for whoever ends up fixing #11567, be it me based on my buggy implementation (which is _not_ included in this patch series), or someone else.
No backports needed - this is part of a new feature, which we don't usually backport.
Closesscylladb/scylladb#20383
* github.com:scylladb/scylladb:
test/alternator: more extensive tests for GSI with two new key attributes
test/alternator: test invalid key types for GSI
test/alternator: test combination of LSI and GSI
test/alternator: expand another test to use different write operations
test/alternator: test GSIs with different key types
alternator: better error message in some cases of key type mismatch
test/alternator: test for more elaborate GSI updates
test/alternator: strengthen tests for empty attribute values
test/alternator: fix typo in test_batch.py
test/alternator: more checks for GSI-key attribute validation
Alternator: drop unneeded "IS NOT NULL" clauses in MV of GSI/LSI
test/alternator: add more checks for adding/deleting a GSI
test/alternator: ensure table deletions in test_gsi.py
There are some test cases in sstable_directory_test test actually create
a table with CQL and then try to manipulate its sstables with the help
of sstable_directory. Those tests use existing local helper that starts
sharded<sstable_directory> and this helper passes test-local static
schema to sstable_directory constructor. As a result -- the schema of a
table that test case created and the schema that sstable_directory works
with are different. They match in the columns layout, which helps the
test cases pass, but otherwise are two different schema objects with
different IDs. It's more correct to use table schema for those runs.
The fix introduces another helper to start sharded<sstable_directory>,
and the older wrapper around cql_test_env becomes unused. Drop it too
not to encourage future tests use it and re-introduce schema mismatch
again.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20499
To be able to atomically delete sstables both in
base table directory and in its sub-directories,
like `staging/`, use a shared pending_delete_dir
under under the base directory.
Note that this requires loading and processing
the base directory first.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, there are leftover log messages using
sstlog rather than dirlog, that was introduced
in aebd965f0e,
and that makes debugging harder.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When split, scrub, and upgrade compactions ran under the compaction
group, they had to bump up their shares to a minimum of 200 to prevent
slow progress as they neared completion, especially in workloads with
inconsistent ingestion rates. Since commit e86965c2 moved these
compactions to the maintenance group, this share bump is no longer
necessary. This patch removes the unnecessary share allocation.
Fixes#20224
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20495
Any expired tombstone can be garbage collected if it doesn't shadow data in the commit log, memtable, or uncompacting SSTables.
This PR introduces a new mode to major compaction, enabled by the `consider_only_existing_data` flag that bypasses these checks. When enabled, memtables and old commitlog segments are cleared with a system-wide flush and all the sstables (after flush) are included in the compaction, so that it works with all data generated up to a given time point.
This new mode works with the assumption that newly written data will not be shadowed by expired tombstones. So it ignores new sstables (and new data written to memtable) created after compaction started. Since there was a system wide flush, commitlog checks can also be skipped when garbage collecting tombstones. Introducing data shadowed by a tombstone during compaction can lead to undefined behavior, even without this PR, as the tombstone may or may not have already been garbage collected.
Fixes#19728Closesscylladb/scylladb#20031
* github.com:scylladb/scylladb:
cql-pytest: add test to verify consider_only_existing_data compaction option
tools/scylla-nodetool: add consider-only-existing-data option to compact command
api: compaction: add `consider_only_existing_data` option
compaction: consider gc_check_only_compacting_sstables when deducing max purgeable timestamp
compaction: do not check commitlog if gc_check_only_compacting_sstables is enabled
tombstone_gc_state: introduce with_commitlog_check_disabled()
compaction: introduce new option to check only compacting sstables for gc
compaction: rename maybe_flush_all_tables to maybe_flush_commitlog
compaction: maybe_flush_all_tables: add new force_flush param
Currently, when attempting to send a hint, we might choose its recipients in one of two ways:
- If the original destination is a natural endpoint of the hint, we only send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.
There is a problem when we decommission a node: while data is streamed away from that node, it is still considered to be a natural endpoint of the data that it used to own. Because of that, it might happen that a hint is sent directly to it but streaming will miss it, effectively resulting in the hint being discarded.
As sending the hint _only_ to the leaving replica is a rather bad idea, send the hint to all replicas also in the case when the original destination of the hint is leaving.
Note that this is a conservative fix written only with the decommission + vnode-based keyspaces combo in mind. In general, such "data loss" can occur in other situations where the replica set is changing and we go through a streaming phase, i.e. other topology operations in case of vnodes and tablet load balancing. However, the consistency guarantees of hinted handoff in the face of topology changes are not defined and it is not clear what they should be, if there should be any at all. The picture is further complicated by the fact that hints are used by materialized views, and sending view updates to more replicas than necessary can introduce inconsistencies in the form of "ghost rows". This fix was developed in response to a failing test which checked the hint replay + decommission scenario, and it makes it work again.
Fixesscylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
Should be backported to 6.0 and 6.1; the dtest started failing due to topology on raft, which sped up execution of the test and exposed the preexisting problem.
Closesscylladb/scylladb#20488
* github.com:scylladb/scylladb:
test: topology_custom/test_hints: consistency test for decommission
test: topology_custom/test_hints: move sync point helpers to top level
test: topology/util: extract find_server_by_host_id
hints: send hints with CL=ALL if target is leaving
hints: inline do_send_one_mutation
~~~
What we have today in "docs/dev/docker-hub.md" on "aio-max-nr" dates back
to scylla commit f4412029f4 ("docs/docker-hub.md: add quickstart section
with --smp 1", 2020-09-22). Problems with the current language:
- The "65K" claim as default value on non-production systems is wrong;
"fs/aio.c" in Linux initializes "aio_max_nr" to 0x10000, which is 64K.
- The section in question uses equal signs (=) incorrectly. The intent was
probably to say "which means the same as", but that's not what equality
means.
- In the same section, the relational operator "<" is bogus. The available
AIO count must be at least as high (>=) as the requested AIO count.
- Clearer names should be used;
adjust_max_networking_aio_io_control_blocks() in "src/core/reactor.cc"
sets a great example:
- "reactor::max_aio" should be called "storage_iocbs",
- "detect_aio_poll" should be called "preempt_iocbs",
- "reactor_backend_aio::max_polls" should be called "network_iocbs".
- The specific value 10000 for the last one ("network_iocbs") is not
correct in scylla's context. It is correct as the Seastar default, but
scylla has used 50000 since commit 2cfc517874 ("main, test: adjust
number of networking iocbs", 2021-07-18).
Rewrite the section to address these problems.
See also:
- https://github.com/scylladb/scylladb/issues/5981
- https://github.com/scylladb/seastar/pull/2396
- https://github.com/scylladb/scylladb/pull/19921
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
~~~
No need for backporting; the documentation being refreshed targets developers as audience, not end-users.
Closesscylladb/scylladb#20398
* github.com:scylladb/scylladb:
docs/dev/docker-hub.md: refresh aio-max-nr calculation
docs/dev/docker-hub.md: strip trailing whitespace
The yielding lister is considered to be better replacement that scan_dir(lambda) one.
Also, the sstable directory will be patched to scan the contents of S3 bucket and yielding lister fits better for generalization.
Closesscylladb/scylladb#20114
* github.com:scylladb/scylladb:
sstable_directory: Fix indentation after previous patches
sstable_directory: Use yielding lister in .handle_sstables_pending_delete()
sstable_directory: Use yielding lister in .cleanup_column_family_temp_sst_dirs()
sstable_directory: Use yielding lister in .prepare()
sstable_directory: Shorten lister loop
sstable_directory: Use with_closeable() in .process()
directory_lister: Add noexcept default move-constructor
In sstable directory test there are two of those -- one that works on
path, state, env and callback, and the other one that just needs env and
callback, getting path from env and assuming state is normal.
Two test cases in this test can enjoy the shorter one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20395
On very large node, LimitNOFILES=80000 may not enough size, it can cause
"Too many files" error.
To avoid that, let's increase LimitNOFILES on scylla_setup stage,
generate optimal value calurated from memory size and number of cpus.
Closesscylladb/scylla-enterprise#4304Closesscylladb/scylladb#20443
c70f321c6f added an extra check if KS
exists. This check can throw `data_dictionary::no_such_keyspace`
exception, which is supposed to be caught and a more user-friendly
exception should be thrown instead.
This commit fixes the above problem and adds a testcase to validate it
doesn't appear ever again.
Also, I moved the check for the keyspace outside of the `for` loop, as
it doesn't need to be checked repeatedly.
Fixes: scylladb/scylladb#20097Closesscylladb/scylladb#20404
The case of a GSI with two key attributes (hash and range) which were both
not keys in the base table is a special case, not supported by CQL but
allowed in Alternator. We have several tests for this case, but they don't
cover all the strange possibilities that a GSI row disappears / reappears
when one or two of the attributes is updated / inserted / deleted.
So this patch includes a more extensive test for this case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test that types which are not allowed for GSI keys -
basically any type except S(tring), B(ytes) or N(number), are rejected
as expected - an error path that we didn't cover in existing tests.
The new test passes - Alternator doesn't have a bug in this area, and
as usual, also passes on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To allow adding a GSI to an existing table (refs #11567), we plan to
re-implement GSIs to stop forcing their key attribute to become a real
column in the schema - and let it remains a member of the map ":attrs"
like all non-key attributes. But since LSIs can only be defined on table
creation time, we don't have to change the LSI implementation, and these
can still force their key to become a real column.
What the test in this patch does is to verify that using the same
attribute as a key of *both* GSI and LSI on the same table works.
There's a high risk that it won't work: After all, the LSI should force the
attribute to become a real column (to which base reads and writes go), but
the GSI will use a computed column which reads from ":attrs", no? Well,
it turns out that view.cc's value_getter::operator() always had a
surprising exception which "rescues" this test and makes it pass: Before
using a computed column, this code checks if a base-table column with the
same name exists, and if it does, it is used instead of the computed column!
It's not clear why this logic was chosen, but it turns out to be really
useful for making the test in this test pass. And it's important that if
we ever change that unintuitive behavior, we will have this test as a
regression test.
The new test unsurprisingly passes on current Scylla because its
implementation of GSI and LSI is still the same. But it's an important
regression test for when we change the GSI implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Expand another Alternator test (test_gsi.py::test_gsi_missing_attribute)
to write items not just using PutItem, but also using UpdateItem and
BatchWriteItem. There is a risk that these different operations use
slightly different code paths - so better check all of them and not
just PutItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All of the tests in test/alternator/test_gsi.py use strings as the GSI's
keys. This tests a lot of GSI functionality, but we implicitly assumed that
our implementation used an already-correct and already-tested implementation
of key columns and MV, which if it works for one type, works for other types
as well.
This assumption will no longer hold if we reimplement GSI on a "computed
column" implementation, which might run different code for different types
of GSI key attributes (the supported types are "S"tring, "B"ytes, and
"N"umber).
So in this patch we add tests for writing and reading different types of
GSI key attributes. These tests showed their importance as regression
tests when the first draft of the GSI reimplementation series failed them.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Alternator uses a common function get_typed_value() to read the values
of key attribute and confirm they have the expected type (key attributes
have a fixed type in the schema). If the type is wrong, we want to print
a "Type mismatch" error message.
But the current implementation did the checks in the wrong order, and
as a result could print a "Malformed value object" message instead of a
"Type mismatch". That could happen if the wrong type is a boolean, map,
list, or basically any type whose JSON representation is not a string.
The allowed key types - bytes), string and number - all have string
representations in JSON, but still we should first report the mismatched
type and only report the "Malformed object" if the type matches but the
JSON is faulty.
In addition to fixing the error message, we fix an existing test which
complained in a comment (but ignored) that the error message in some
case (when trying to use a map where a key is expected) the strange
"Malformed value object" instead of the expected "Type mismatch".
The next patch will add an additional reproducer for this problem and
its fix. That test will do:
```
with pytest.raises(ClientError, match='ValidationException.*mismatch'):
test_table_gsi_6.put_item(Item={'p': p, 's': True})
```
I.e., it tries to set a boolean value for a string key column, and
expect to get the "Type mismatch" error and not the ugly "Malformed
value object".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Most tests in test_gsi.py involve simple updates to a GSI, just
creating a GSI row. Although a couple of tests did involve more
complex operations (such as an update requiring deleting an old row
from the GSI and inserting a new one,), we did not have a single
organized test designed to check all these cases, so we add one in
this patch.
This test (test_update_gsi_pk) will be important for verifying
the low-level implementation of the new GSI implementation that
we plan to based on computed columns. Early versions of that code
passed many of the simpler tests, but not this one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We soon plan to refactor Alternator's GSI and change the validation of
values set in attributes which are GSI keys. It's important to test that
when updating attributes that are *not* GSI keys - and are either base-
table keys or normal non-key attributes - the validation didn't change.
For example, empty strings are still not allowed in base-table key
attributes, but are allowed (since May 2020 in DynamoDB) in non-key
attributes.
We did have tests in this area, but this patch strengthens them -
adding a test for non-key attribute, and expanding the key-attribute
test to cover the UpdateItem and BatchWriteItem operations, not just
PutItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To enhance the test reports UX:
1. switching off/on passed/failed/skipped test for better visibility
2. better searching in test results
3. understanding the trends of execution for each test
4. better configurability of the final report
Enable allure adapter for all python tests.
Add tags and parameters to the test to be able to distinguish them across modes and runs.
Related: https://github.com/scylladb/qa-tasks/issues/1665
Related: https://github.com/scylladb/scylladb/pull/19335
Related: https://github.com/scylladb/scylladb/pull/18169Closesscylladb/scylladb#19942
* github.com:scylladb/scylladb:
[test.py] Clean duplicated arg for test suite
[test.py] Enable allure for python test
Two tests had a typo 'item' instead of 'Item'. If Scylla had a bug, this
could have caused these tests to miss the bug.
Scylla passes also the fixed test, because Scylla's behavior is correct.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When an attribute is a GSI key, DynamoDB imposes certain rules when
writing values for it - it must be of the declared type for that key,
and can't be an empty string. We had tests for this, but all of them
did the write using the PutItem operation.
In this patch we also test the same things using the UpdateItem and
BatchWriteItem operations. Because Scylla has different code paths
for these three operations, and each code path needs to remember to
call the validation function, all three should all be checked and not just
PutItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Scylla's materialized views naturally skip any base rows where the view's
key isn't set (is NULL), because we can't create a view row with a null
key. To make the user aware that this is happening, the user is required
to add "WHERE ... IS NOT NULL" for the view's key columns when defining
the view. However, the only place that these extra IS NOT NULL clauses
are checked are in the CQL "CREATE MATERIALIZED VIEWS" statement - they
are completely ignored in all other places in the code.
In particular, when we create a materialized view in Alternator (GSI or
LSI), we don't have to add these "IS NOT NULL" clauses, as they are
outright ignored. We didn't know they were ignored, and made an effort
to add them - but no matter how incorrectly we did it, it didn't matter :-)
In commit 2bf2ffd3ed it turned out we had a
typo that caused the wrong column name to be printed. Also, even today we
are still missing base key columns that aren't listed as a view key in
Alternator but still added as view clustering keys in Scylla - and again
the fact these were missing also didn't matter. So I think it's time to
stop pretending, and stop calculating these "IS NOT NULL" strings, so
this patch outright removes them from the Alternator view-creation code.
Beyond being a nice cleanup of unnecessary and inaccurate code, it
will also be necessary when we allow in later patches to index for
an Alternator attribute "x" not a real column x in the base table but
rather an element in the ":attrs" map - so adding a "x IS NOT NULL" isn't
only unnecessary, it is outright illegal: The expression evaluation code,
even though it doesn't do anything with the "IS NOT NULL" expression,
still verifies that "x" is a valid column, which it isn't.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have tests for the feature of adding or removing a GSI from
an existing table, which Alternator doesn't yet support (issue #11567).
In this patch we add another check, how after a GSI is added, you can
no longer add items with the wrong type for the indexed type, and after
removing a GSI, you can. The expanded tests pass on DynamoDB, and
obviously still xfail on Alternator because the feature is not yet
implemented.
Refs #11567.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Most of the Alternator tests are careful to unconditionally remove the test
tables, even if the test fails. This is important when testing on a shared
database (e.g., DynamoDB) but also useful to make clean shutdown faster
as there should be no user table to flush.
We missed a few such cases in test_gsi.py, and fixed some of them in
commit 59c1498338 but still missed a few,
and this patch fixes some more instances of this problem.
We do this by using the context manager new_test_table() - which
automatically deletes the table when done - instead of the function
create_test_table() which needs an explicit delete at the end.
There are no functional changes in this patch - most of the lines
changed are just reindents.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
in main.cc, we start redis with `ss.local().register_protocol_server()`
only if it is enabled. but `storage_service` always calls
`stop_server()` with _all_ registered server, no matter if they have
started or not. in general, it does not hurt. for instance,
`redis::controller::stop_server()` is a noop, if the controller
is not started. but `storage_service` still print the logging message
like:
```
INFO 2024-09-04 11:20:02,224 [shard 0:main] storage_service - Shutting down redis server
INFO 2024-09-04 11:20:02,224 [shard 0:main] storage_service - Shutting down redis server was successful
```
this could be confusing or at least distracting when a field engineer
looks at the log. also, please note, `redis_port` and `redis_ssl_port`
cannot be changed dynamically once scylla server is up, so we do not
need to worry about "what if the redis server is started at runtime,
how can is be stopped?".
the same applies to alternator service.
in this change, to avoid surprises, we conditionally register the
protocol servers with the storage service based on their enabled statuses.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20472
after switching over to the new `seastar::format()` which enables
the compile-time format check, the fmt string should be a constexpr,
otherwise `fmt::format()` is not able to perform the check at compile
time.
to prepare for bumping up the seastar module to a version which
contains the change of `seastar::format()`, let's mark the format
string with `constexpr const`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20484
Adds the test_hints_consistency_during_decommission test which
reproduces the failure observed in scylladb/scylla-dtest#4582. It uses
error injections, including the newly added
topology_coordinator_pause_after_streaming injection, to reliably
orchestrate the scenario observed there.
In a nutshell, the test makes sure to replay hints after streaming
during decommission has finished, but before the cluster switches to
reading from new replicas. Without the fix, hints would be replayed to
the decommissioned node and then would be lost forever after the cluster
start reading from new replicas.
Move create_sync_point and await_sync_point from the scope of the
test_sync_point test to the file scope. They will be used in a test that
will be introduced in the commit that follows.
Currently, when attempting to send a hint, we might choose its
recipients in one of two ways:
- If the original destination is a natural endpoint of the hint, we only
send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.
There is a problem when we decommission a node: while data is streamed
away from that node, it is still considered to be a natural endpoint of
the data that it used to own. Because of that, it might happen that a
hint is sent directly to it but streaming will miss it, effectively
resulting in the hint being discarded.
As sending the hint _only_ to the leaving replica is a rather bad idea,
send the hint to all replicas also in the case when the original
destiantion of the hint is leaving.
Note that this is a conservative fix written only with the decommission
+ vnode-based keyspaces combo in mind. In general, such "data loss" can
occur in other situations where the replica set is changing and we go
through a streaming phase, i.e. other topology operations in case of
vnodes and tablet load balancing. However, the consistency guarantees of
hinted handoff in the face of topology changes are not defined and it is
not clear what they should be, if there should be any at all. The
picture is further complicated by the fact that hints are used by
materialized views, and sending view updates to more replicas than
necessary can introduce inconsistencies in the form of "ghost rows".
This fix was developed in response to a failing test which checked the
hint replay + decommission scenario, and it makes it work again.
Fixesscylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
It's a small method and it is only used once in send_one_mutation.
Inlining it lets us get rid of its declaration in the header - now, if
one needs to change the variables passed from one function to another,
it is no longer necessary to change the header.
Shorter and simpler this way. Hopefully it doesn't sit on critical paths
Closesscylladb/scylladb#20460
* github.com:scylladb/scylladb:
sstables: Fix indentation after previous patch
sstables: Coroutinize sstable::read_summary()
The idea of the test is to have a cluster where one node is stressed with injections and failures and the rest of the cluster is used to make progress of the raft state machine.
To achieve this following two lists introduced in the PR:
- ERROR_INJECTIONS in error_injections.py
- CLUSTER_EVENTS in cluster_events.py
Each cluster event is an async generator which has 2 yields and should be used in the following way:
0. Start the generator:
```python
>>> cluster_event_steps = cluster_event(manager, random_tables, error_injection)
```
1. Run the prepare part (before the first yield)
```python
>>> await anext(cluster_event_steps)
```
2. Run the cluster event itself (between the yields)
```python
>>> await anext(cluster_event_steps)
```
3. Run the check part (after the second yield)
```python
>>> await anext(cluster_event, None)
```
Closesscylladb/scylladb#16223
* github.com:scylladb/scylladb:
test: randomized failure injection for Raft-based topology
test: error injections for Raft-based topology
[test.py] topology.util: add get_non_coordinator_host() function
[test.py] random_tables: add UDT methods
[test.py] random_tables: add CDC methods
[test.py] api: get scylla process status
[test.py] api: add expected_server_up_state argument to server_add()
The `service_level::marked_for_deletion` field is always set to `false`. It might have served some purpose in the past, but now it can be just removed, simplifying the code and eliminating confusion about the field.
This is just code cleanup, no backport is needed.
Closesscylladb/scylladb#20452
* github.com:scylladb/scylladb:
service/qos: remove the marked_for_deletion parameter
service/qos: add constructors to service_level
The test cases in this file use an error injection to reduce raft group
0 timeouts (from the default 1 minute), in order to speed up the tests;
the scenarios expect these timeouts to happen, so we want them to happen
as quick as possible, but we don't want to reduce timeouts so much that
it will make other operations fail when we don't expect them to (e.g.
when the test wants to add a node to the cluster).
Unfortunately the selected 5 seconds in debug mode was not enough and
made the tests flaky: scylladb/scylladb#20111.
Increase it to 10 seconds. This unfortunately will slow down these tests
as they have to sometimes wait for 10 seconds for the timeout to happen.
But better to have this than a flaky test.
Fixes: scylladb/scylladb#20111Closesscylladb/scylladb#20320
During review of 0857b63259 it was noticed that the function
repair_get_row_diff_with_rpc_stream_process_op()
and its _slow_path callee only ever return stop_iteration::no (or throw
an exception). As such, its return value is useless, and in fact the
only caller ignores it. Simplify by returning a plain future<>.
Closesscylladb/scylladb#20441
so that it is accessible from its caller. if we enforce the
compile-time format string check, the formatter would need the access to
the specialization of `fmt::formatter` of the arguments being foramtted.
to be prepared for this change, let's move the `fmt::formatter`
specialization up, otherwise we'd have following error after switching
to the compile-time format string check introduced by a recent seastar
change:
```
In file included from ./auth/authenticator.hh:22: ./auth/authentication_options.hh:50:49: error: call to consteval function 'fmt::basic_format_string<char, auth::authentication_option &>::basic_format_string<
char[32], 0>' is not a constant expression
50 | : std::invalid_argument(fmt::format("The {} option is not supported.", k)) {
| ^ ./auth/authentication_options.hh:57:13: error: explicit specialization of 'fmt::formatter<auth::authentication_option>' after instantiation
57 | struct fmt::formatter<auth::authentication_option> : fmt::formatter<string_view> {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/fmt/base.h:1228:17: note: implicit instantiation first required here
1228 | -> decltype(typename Context::template formatter_type<T>().format(
| ^
In file included from replica/distributed_loader.cc:30:
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20447
And move the comment inside if while at it, it looks better in there
(and makes less churn in the patch itself)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The idea of the test is to have a small cluster, where one node
is stressed with injections and failures and the rest of
the cluster is used to make progress of the Raft state machine.
To achieve this following two lists introduced in the commit:
- ERROR_INJECTIONS in error_injections.py
- CLUSTER_EVENTS in cluster_events.py
Each cluster event is an async generator which has 2 yields and should be
used in the following way:
0. Start the generator:
>>> cluster_event_steps = cluster_event(manager, random_tables, error_injection)
1. Run the prepare part (before the first yield)
>>> await anext(cluster_event_steps)
2. Run the cluster event itself (between the yields)
>>> await anext(cluster_event_steps)
3. Run the check part (after the second yield)
>>> await anext(cluster_event, None)
Add get_non_coordinator_host() function which returns
ServerInfo for the first host which is not a coordinator
or None if there is no such host.
Also rework get_coordinator_host() to not fail if some
of the hosts don't have a host id.
Allow to return from server_add() when a server reaches specified state.
One of:
- PROCESS_STARTED
- HOST_ID_QUERIED (previously called NOT_CONNECTED)
- CQL_CONNECTED (renamed from CONNECTED)
- CQL_QUERIED (was just QUERIED)
Also, rename CqlUpState to ServerUpState and move to internal_types.
After it was switched to use schema builder, the indenation of untouched
lines deserves one extra space.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Everything, but perf test is straightforward switch.
The perf-test generated regular columns dynamically via vector, with
builder the vector goes away.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When creating sstables this test allocates temporary local options.
That works, because this test doesn't run on object storage, but it's
more correct to pick storage options from the table at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20440
Add a test replacing a node and verifying the contents of the
view_build_status table are updated as expected, having rows for the new
node and no rows for the old node.
Change the util function wait_for_view to read the view build status
from the system.view_build_status_v2 table which replaces
system_distributed.view_build_status.
The old table can still be used but it is less efficient because it's
implemented as a virtual table which reads from the v2 table, so it's
better to read directly from the v2 table. This can cause slowness in
tests.
The additional util function wait_for_view_v1 reads from the old table.
This may be needed in upgrade tests if the v2 table is not available
yet.
When writing to the view_build_status we have common logic related to
upgrade and deciding whether to write to sys_dist ks or group0.
Move this common logic to a generic function used by all functions
writing to the table.
Add an intermediate phase to the view builder migration to v2 where we
write to both the old and new table in order to not lose writes during
the migration.
We add an additional view builder version v1_5 between v1 and v2 where
we write to both tables. We perform a barrier before moving to v2 to
ensure all the operations to the old table are completed.
When a node is removed we want to clean its rows from the
view_build_status table.
Now when removing a node and generating the topology state update, we
generate also the mutations to delete all the possible rows belonging to
the node from the table.
When migrating the view_build_status to v2, skip adding any leftover
rows that don't correspond to an existing node or an existing view.
Previously such rows could have been created and not cleaned, for
example when a node is removed.
After migrating the view build status from
system_distributed.view_build_status to system.view_build_status_v2, we
set system_distributed.view_build_status to be a virtual table, such
that reading from it is actually reading from the underlying new table.
The reason for this is that we want to keep compatibility with the old
table, since it exists also in Cassandra and it is used by various external
tools to check the view build status. Making the table virtual makes the
transition transparent for external users.
The two tables are in different keyspaces and have different shard
mapping. The v1 table is a distributed table with a normal shard
mapping, and the v2 table is a local table using the null sharder. The
virtual reader works by constructing a multishard reader which reads the rows
from shard zero, and then filtering it to get only the rows owned by the
current shard.
Add tests to verify the new view_build_status_v2 is used by the
view_builder and can be read from all nodes with the expected values.
Also test a migration from the v1 layout to v2.
Migrate view_builder to v2, to store the view build status of all nodes
in the group0 based table view_build_status_v2.
Introduce a feature view_build_status_on_group0 so we know when all
nodes are ready to migrate and use the new table.
A new cluster is initialized to use v2. Otherwise, The topology coordinator
initiates the migration when the feature is enabled, if it was not done
already.
The migration reads all the rows in the v1 table and writes it via
group0 to the v2 table, together with a mutation that updates the
view_builder parameter in scylla_local to v2. When this mutation is
applied, it updates the view_builder service to start using the v2
table.
Add a new scylla_local parameter view_builder_version, and functions to
read and mutate the value.
The version value defaults to v1 if it doesn't exist in the table.
Update the view_status function to read from the new
view_build_status_v2 table when enabled.
The code to read and extract the values is identical to v1 and v2 except it
accesses different keyspace and table, so the common code is extracted
to the view_status_common function and used by both v1 and v2 flows with
appropriate parameters.
Introduce the announce_with_raft function as alternative to writing view build
status mutations to the table in system_distributed. Instead, we can
apply the mutations via group0 operation to the view_build_status_v2
table.
All the view_builder functions that write to the view_build_status table
can be configured by a flag to either write the legacy way or via raft.
Store references of group0_client and query_processor in the
view_builder service.
They are required for generating mutations and writing them via group0.
Because of https://github.com/scylladb/scylladb/issues/9285 heat weighted
load balancer may sometimes return same node twice. It may cause wrong
data to be read or unexpected errors to be returned to a client. Since
the original bug is not easy to fix and it is rare lets introduce a
workaround. We will check for duplicates and will use non HWLB one if
one is found.
Fixesscylladb/scylladb#20430Closesscylladb/scylladb#20414
When testing mv admission control, we perform a large view update
and check if the following view update can be admitted due to the
high view backlog usage. We rely on a delay which keeps the backlog
high for longer to make sure the backlog is still increased during
the second write. However, in some test runs the delay is not long
enough, causing the second write to miss the large backlog and not
hit admission control.
In this patch we keep the increased backlog high using another
injection instead of relying on a delay to make absolute sure
that the backlog is still high during the second write.
Fixesscylladb/scylladb#20382Closesscylladb/scylladb#20445
Added a new parameter `consider_only_existing_data` to major compaction
API endpoints. When enabled, major compaction will:
- Force-flush all tables.
- Force a new active segment in the commit log.
- Compact all existing SSTables and garbage-collect tombstones by only
checking the SSTables being compacted. Memtables, commit logs, and
other SSTables not part of the compaction will not be checked, as they
will only contain newer data that arrived after the compaction
started.
The `consider_only_existing_data` is passed down to the compaction
descriptor's `gc_check_only_compacting_sstables` option to ensure that
only the existing data is considered for garbage collection.
The option is also passed to the `maybe_flush_commitlog` method to make
sure all the tables are flushed and a new active segment is created in
the commit log.
Fixes#19728
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When gc_check_only_compacting_sstables is enabled,
get_max_purgeable_timestamp should not check memtables and other
sstables that are not part of the compaction to deduce the max purgeable
timestamp.
Refs #19728
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When the compaction_descriptor's gc_check_only_compacting_sstables flag
is enabled, create and pass a copy of the get_tombstone_gc_state that
will skip checking the commitlog.
Refs #19728
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added a new method, `with_commitlog_check_disabled`, that returns a new
copy of the tombstone_gc_state but with commitlog check disabled. This
will be used by a following patch to disable commitlog checks during
compaction.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added new option, `gc_check_only_compacting_sstables`, to
compaction_descriptor to control the garbage collection behavior. The
subsequent patches will use this flag to decide if the garbage
collection has to check only the SSTables being compacted to collect
tombstones. This option is disabled for now and will be enabled based on
a new compaction parameter that will be added later in this patch
series.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Major compaction flushes all tables as a part of flushing the commitlog.
After forcing new active segments in the commitlog, all the tables are
flushed to enable reclaim of older commitlog segments. The main goal is
to flush the commitlog and flushing all the table is just a dependency.
Rename maybe_flush_all_tables to maybe_flush_commitlog so that it
reflects the actual intent of the major compaction code. Added a new
wrapper method to database::flush_all_tables(),
database::flush_commitlog(), that is now called from
maybe_flush_commitlog.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Add a new parameter, `force_flush` to the maybe_flush_all_tables()
method. Setting `force_flush` to true will flush all the tables
regardless of when they were flushed last. This will be used by the new
compaction option in a following patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Before the introduction of "scripts/refresh-submodules.sh", there was
indeed some manual work for the maintainer to do, hence "publish your
work" must have sounded correct. Today, the phrase "publish your work"
sounds confusing.
Commit 71da4e6e79 ("docs: Document sync-submodules.sh script in
maintainer.md", 2020-06-18) should have arguably reworded the last step of
the submodule refresh procedure; let's do it now.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#20333
The with_sstable_dir() helper no longer needs one, it used to pass it as
argument to sstable_directory constructor, but now the directory doesn't
need it (takes semaphore via table object).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20396
Squash call to lister.get() and check for the returned value into
while()'s condition. This saves few more lines of code as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method already uses yielding lister, but handles the exceptions
explicitly. Use with_closeable() helper, it makes the code shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's required to make it possible to push lister into with_closeable().
Its requiremenent of nothrow-move-constructible doesn't accept
default-generated one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The `skip()` method of the compressed data source implementation uses an
assert statement to check if the given offset is valid.
Replace this with `on_internal_error()` to fail gracefully. An invalid
offset shouldn't bring the whole server down.
Also, enhance the error message for unsynced compressed readers.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
It used to rely on bool (wrapped with pointer) and future<>-based loop
helper, now it can just break from the while loop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And update its callers again.
Preserve no longer relevant local smart pointers until next patch.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One accepts integer generations, another one accepts "generic" ones. The
latter is only called by the former, so no sense in keeping it around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add a default constructor and a constructor which explicitly
initializes all fields of the service_level structure.
This is done in order to make sure that removal of the
marked_for_deletion field can be done safely - otherwise, for example,
service_level could be aggregate-initialized with an incomplete list of
values for the fields, and removing marked_for_deletion which is in the
middle of the struct would cause the is_static field to be initialized
with the value that was designated for marked_for_deletion.
As a bonus, make sure that marked_for_deletion and is_static bool fields
are initialized in the default constructor to false in order to avoid
potential undefined behavior.
There are two versions of `raft_group0_client::hold_read_apply_mutex`, one takes `abort_source&`, the other doesn't. Modify all call sites that used the non-abort-source version to pass an `abort_source&`, allowing us to remove the other overload.
If there is no explicit reason not to pass an `abort_source&`, then one should be passed by default -- it often prevents hangs during shutdown.
---
No backport needed -- no known issues affected by this change.
Closesscylladb/scylladb#19996
* github.com:scylladb/scylladb:
raft_group0_client: remove `hold_read_apply_mutex` overload without `abort_source&`
storage_service: pass `_abort_source` to `hold_read_apply_mutex`
group0_state_machine: pass `_abort_source` to `hold_read_apply_mutex`
api: move `reload_raft_topology_state` implementation inside `storage_service`
for following reasons:
1. the ppa in question does not provide the build for the latest ubuntu's LTS release. it only builds for trusty, xenial, bionic and jammy. according to https://wiki.ubuntu.com/Releases, the latest LTS release is ubuntu noble at the time of writing.
2. the ppa in question does not provide the packages used in production. it does provides the package for *building* scylla
3. after we introduced the relocatable package, there is no need to provide extra user space dependencies apart from scylla packages.
so, in this change, we remove all references to enabling the Scylla/PPA repository.
Fixesscylladb/scylladb#20449
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20450
for better readability.
presumably, `sstable::seal_sstable()` is not on the critical path,
and we don't need to worry about the overhead of using C++20 coroutine.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20410
instead of chaining the conditions with '&&', break them down.
for two reasons:
* for better readability: to group the conditions with the same
purpose together
* so we don't look up the table twice. it's an anti-pattern of
using STL, and it could be confusing at first glance.
this change is a cleanup, so it does not change the behavior.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20369
The Alternator TTL scanning code uses an object "scan_ranges_context"
to hold the scanning context. One of the members of this object is
a service::query_state, and that in turn holds a reference to a
service::client_state. The existing constructor created a temporary
client_state object and saved a reference to it - which can result
in use after free as the temporary object is freed as soon as the
constructor ends.
The fix is to save a client_state in the scan_ranges_context object,
instead of a temporary object.
Fixes#19988
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20418
Makes some commitlog options runtime updatable. Most important for this case,
the usage of fragmented entries. Also adds a subscription in database on said
feature, to possibly enable once cluster enables it.
Hides the functionality behind a cluster feature, i.e. postspones
using it until an upgrade is complete etc. This to allow rolling back
even with dirty nodes, at least until a cluster is commited.
Feature can also be disabled by scylla option, just in case. This will
lock it out of whole cluster, but this is probably good, because depending
on off or on, certain schema/raft ops might fail or succeed (due to large
mutations), and this should probably be equivalent across nodes.
Refs #18161
Yet another approach to dealing with large commitlog submissions.
We handle oversize single mutation by adding yet another entry
type: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.
Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
*
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.
On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.
Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.
This approach bumps the file format (docs to come).
To ensure "atomicity" we both force syncronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.
v2:
* Improve some bookeep, ensure we keep track of segments and flush
properly, to get counter correct
This commit temporarily disables redirections for all pages under Features
that were moved with this PR: https://github.com/scylladb/scylladb/pull/20401
Redirections work for all versions. This means that pages in 6.1 are redirected
to URLs that are not available yet (because 6.2 has not been released yet).
The redirections are correct and should be enabled when 6.2 is released:
I've created an issue to do it: https://github.com/scylladb/scylladb/issues/20428Closesscylladb/scylladb#20429
Tests that try to access sstables from test/resource/ typically sstable::load() it after object creation. There's reusable_sst() helper for that. This PR fixes one more caller that still goes longer route by doing sstable and loading it on its own.
Closesscylladb/scylladb#20420
* github.com:scylladb/scylladb:
test: Call reusable sst from ka_sst() helper
test: Move sstable_open_config to reusable_sst()'s argument
There's no point waiting for this lock if `storage_service` is being
aborted. In theory the lock, if held, should be eventually released by
whatever is holding it during shutdown -- but if there is some cyclic
reference between the services, and e.g. whatever holds the lock is
stuck because of ongoing shutdown and would only be unstuck by
`storage_service` getting stopped (which it can't because it's waiting
on the lock), that would cause a shutdown deadlock. Better to be safe
than sorry.
In later commit we'll want to access more `storage_service` internals
in the API's implementation (namely, `_abort_source`)
Also moving the implementation there allows making
`service::topology_transition()` private again (it was made public in
992f1327d3 only for this API
implementation)
Run the reversed queries on a 2-node cluster with CL=ALL with and
without NATIVE_REVERSE_QUERIES feature flag. When the flag is enabled,
the native reversed format is used, otherwise the legacy format.
The NATIVE_REVERSE_QUERIES feature flag is suppressed with an error
injection that simulates cluster upgrade process.
Backport is not required. The patch adds additional upgrade tests
for https://github.com/scylladb/scylladb/pull/18864Closesscylladb/scylladb#20179
This patch address two requests made by reviewers of the original "Add
CQL-based RBAC support to Alternator" series. Both requests were about
the error messages produced when access is denied:
1. The error message is improved to use more proper English, and also
to include the name of the role which was denied access.
2. The permission-check and error-message-formatting code is
de-duplicated, using a common function verify_permission().
This de-duplication required moving the access-denied error path to
throwing an exception instead of the previous exception-free
implementation. However, it can be argued that this change is actually
a good thing, because it makes the successful case, when access is
allowed, faster.
The de-duplicated code is shorter and simpler, and allowed changing
the text of the error message in just one place.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20326
in 372a4d1b79, we introduced a change
which was for debugging the logging message. but the logging message
intended for printing the temp_dir not prints an `optional<int>`. this
is both confusing, and more importantly, it hurts the debuggability.
in this change, the related change is reverted.
Fixesscylladb/scylladb#20408
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20409
The sstable_mutation_test wants to load pre-existing sstables from
resouce/ subdir. For that there's reusable_sst() helper on env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
What we have today in "docs/dev/docker-hub.md" on "aio-max-nr" dates back
to scylla commit f4412029f4 ("docs/docker-hub.md: add quickstart section
with --smp 1", 2020-09-22). Problems with the current language:
- The "65K" claim as default value on non-production systems is wrong;
"fs/aio.c" in Linux initializes "aio_max_nr" to 0x10000, which is 64K.
- The section in question uses equal signs (=) incorrectly. The intent was
probably to say "which means the same as", but that's not what equality
means.
- In the same section, the relational operator "<" is bogus. The available
AIO count must be at least as high (>=) as the requested AIO count.
- Clearer names should be used;
adjust_max_networking_aio_io_control_blocks() in "src/core/reactor.cc"
sets a great example:
- "reactor::max_aio" should be called "storage_iocbs",
- "detect_aio_poll" should be called "preempt_iocbs",
- "reactor_backend_aio::max_polls" should be called "network_iocbs".
- The specific value 10000 for the last one ("network_iocbs") is not
correct in scylla's context. It is correct as the Seastar default, but
scylla has used 50000 since commit 2cfc517874 ("main, test: adjust
number of networking iocbs", 2021-07-18).
Rewrite the section to address these problems.
See also:
- https://github.com/scylladb/scylladb/issues/5981
- https://github.com/scylladb/seastar/pull/2396
- https://github.com/scylladb/scylladb/pull/19921
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
This commit one of the series to remove the FAQ page by removing irrelevant/outdated entries
or moving them to the forum.
The question about seeds is irrelevant, not frequently asked, and covered in other sections
of the docs. Also, it mentions versions that are no longer supported.
Closesscylladb/scylladb#20403
Even after 13caac7, we still have more files incorrect permission, since
we use "cp -r" and creating new file with redirect.
To fix this, we need to replace "cp -r" with "cp -pr", and "chmod <perm>" on
newly created files.
Fixes#14383
Related #19775Closesscylladb/scylladb#19786
This commit moves the Features page from the section for developers
to the top level in the page tree. This involves:
- Moving the source files to the *features* folder from the *using-scylla* folder.
- Moving images into *features/images* folder.
- Updating references to the moved resources.
- Adding redirections to the moved pages.
Closesscylladb/scylladb#20401
this change contains two improvements to "backup" and "restore" commands:
- let them print task id
- let them return 1 as the exist status code upon operation failure
----
these changes are improvements to the newly introduced commands, which are not in any LTS branches yet, so no need to backport.
Closesscylladb/scylladb#20371
* github.com:scylladb/scylladb:
tools/scylla-nodetool: return failure with exit code in backup/restore
tools/scylla-nodetool: let backup/restore print task id
before this change, "backup" and "restore" commands always return 0 as
their exist code no matter if the performed operation fails or not.
inspired by the "task" commands of nodetool, let's return 1 with
exit code if the operation fails.
the tests are updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in 20fffcdc, we added the "task wait" subcommand, so user is allowed to
interact with a task with its task id. and in existing implementation of
"backup" and "restore" command, if user does not pass `--nowait`, the
command just exits without any output upon sending the request to
scylladb.
in this change, we print out the task_id if user does not pass
`--nowait` command line option to "backup" or "restore" command. this
allows user to follow up on the operation if necessary.
the tests are updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This patch adds functional testing for the role-based access control
(RBAC) "auto-grant" feature, where if a user that is allowed to create
a table, it also recieves full permissions over the table it just
created. We also test permissions over new materialized views created
by a user, and over CDC logs. The test for CDC logs reproduces an
already suspected bug, #19798: A user may be allowed to create a table
with CDC enabled, but then is not allowed to read the CDC log just
created. The tests show that the other cases (base tables and views)
do not have this bug, and the creating user does get appropriate
permissions over the new table and views.
In addition to testing auto-grant, the patch also includes tests for
the opposite feature, "auto-revoke" - that permissions are removed when
the table/view/cdc is deleted. If we forget to do that while implementing
auto-grant, we risk that users may be able to use tables created by
other users just because they used the same table _name_ earlier.
It's important to have these auto-revoke tests together with the
auto-grant tests that reproduce #19798 - so we don't forget this
part when finally fixing #19798.
Refs #19798.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19845
Bind variables in CQL have two formats: positional (`?`) where a variable is referred to by its relative position in the statement, and named (`:var`), where the user is expected to supply a name->value mapping.
In 19a6e69001 we identified the case where a named bind variable appears twice in a query, and collapsed it to a single entry in the statement metadata. Without this, a driver using the named variable syntax cannot disambiguate which variable is referred to.
However, it turns out that users can use the positional call form even with the named variable syntax, by using the positional API of the driver. To support this use case, we add a configuration variable to disable the same-variable detection.
Because the detection has to happen when the entire statement is visible, we have to supply the configuration to the parser. We call it the `dialect` and pass it from all callers. The alternative would be to add a pre-prepare call similar to fill_prepare_context that rewrites all expressions in a statement to deduplicate variables.
A unit test is added.
Fixes#15559
This may be useful to users transitioning from Cassandra, so merits a backport.
Closesscylladb/scylladb#19493
* github.com:scylladb/scylladb:
cql3: add option to not unify bind variables with the same name
cql3: introduce dialect infrastructure
cql3: prepared_statement_cache: drop cache key default constructor
* in the "Backporting Seastar commits" section, there's a single quote
instead of a backtick in this line, so fix it.
* add backticks around `refresh-submodules.sh`, which is a filename.
* correct the command line setting a git config option, because `git-config`
does not support this command line syntax,
```console
$ git config --global diff.conflictstyle = diff3
$ git config --global get diff.conflictstyle
=
$ git config --global diff.conflictstyle diff3
$ git config --global get diff.conflictstyle
diff3
```
quote from git-config(1)
> ```
> git config set [<file-option>] [--type=<type>] [--all] [--value=<value>] [--fixed-value] <name> <value>
> ```
* stop using the deprecated mode of the `git-config` command, and use
subcommand instead. as git-config(1) puts:
> git config <name> <value> [<value-pattern>]
> Replaced by git config set [--value=<pattern>] <name> <value>.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20328
Check if podman is available before docker. If it is, use it. Otherwise, check for docker.
1. Podman is better. It runs with fewer resources, and I've had display issues with Docker (output was not shown consistently)
2. 'which docker' works even when the docker service and socket are turned off.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#20342
Triggers the "Build Docs" PR workflow whenever the `db/config.cc` or `db/config.h` files are edited. These files are used to produce documentation, and this change will help prevent the introduction of breaking changes to the documentation build when they are modified.
Closesscylladb/scylladb#20347
for testing the load performance of load_and_stream operation.
Refs #19989
---
no need to backport. it adds two new tests to the existing `perf_sstable` tool for evaluating the load performance when performing the "load_and_streaming" operation. hence has no impact on the production.
Closesscylladb/scylladb#20186
* github.com:scylladb/scylladb:
perf/perf_sstable: add {crawling,partitioned}_streaming modes
test/perf/perf_sstable: use switch-case when appropriate
instead of evaluating the constants in-class, accessing them via
a cached class property.
it would be handy if we could source `scylla-gdb.py` in `.gdbinit`,
but this script accesses some symbols which are not available without
a file being debugged. what's why gdb fails to load the init script:
```
Traceback (most recent call last):
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 167, in <module>
class intrusive_slist:
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 168, in intrusive_slist
size_t = gdb.lookup_type('size_t')
^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.error: No type named size_t.
```
so we have to `file path/to/scylla` and *then*
`source scylla-gdb.py` every time when we debug scylla or a seastar
application, instead of loading `scylla-gdb.py` in `.gdbinit`.
the reason is that the script accesses the debug symbols like
`gdb.lookup_type('size_t')` in-class. so when the python interpreter
reads the script, it evaluates this statement, but at that moment,
the debug symbols are not loaded, so `source scylla-gdb.py` fails
in `.gdbinit`.
in this change, we transform all these class variables to cached
properties, so that they
* are evaluated on-demand
* are evaluated only once at most
this addresses the pain at the expense of verbosity.
---
this change intends to improve the developer's user experience, and has no impacts on product, so no need to backport.
Closesscylladb/scylladb#20334
* github.com:scylladb/scylladb:
test/scylla_gdb: test the .gdb init use case
scylla-gdb.py: lazy-evaluate the constants
All users of it have sstable_test_env at hand (in fact -- they call env
method to get table_for_test). And since sstable_test_env already has a
bunch of methods to create sstable, the table_for_test wrapper doesn't
need to duplicate this code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20360
so that we can set this the parameter passed to `-inline-threshold` with
`configure.py` when building with CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20364
before this change, if user does not have `/bin/sh` around, when
installing scylla packages, the script in `%pretrans" is executed,
and fails due to missing `/bin/sh`. per
https://docs.fedoraproject.org/en-US/packaging-guidelines/Scriptlets/#pretrans
> Note that the %pretrans scriptlet will, in the particular case of
> system installation, run before anything at all has been installed.
> This implies that it cannot have any dependencies at all. For this
> reason, %pretrans is best avoided, but if used it MUST (by necessity)
> be written in Lua. See
> https://rpm-software-management.github.io/rpm/manual/lua.html for more
> information.
but we were trying to warn users upgrading from scylla < 1.7.3, which
was released 7 years ago at the time of writing.
in this change, we drop the `%pretrans` section. hopefuly they will
find their way out if they still exist.
Fixesscylladb/scylladb#20321
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20365
before this change, when running `scylla-housekeeping`:
```
/opt/scylladb/scripts/libexec/scylla-housekeeping:122: SyntaxWarning: invalid escape sequence '\s'
match = re.search(".*http.?://repositories.*/scylladb/([^/\s]+)/.*/([^/\s]+)/scylladb-.*", line)
```
we could have the warning above. because `\s` is not a valid escape
sequence, but the Python interpreter accepts it as two separated
characters of `\s` after complaining. but it's still annoying.
so, let's use a raw string here.
Refs scylladb/scylladb#20317
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20359
before this change, when building the test of `view_build_test` with
clang-20, we can have following build failure:
```
FAILED: test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o -MF test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o.d -o test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o -c /home/kefu/dev/scylladb/test/boost/view_build_test.cc
/home/kefu/dev/scylladb/test/boost/view_build_test.cc:998:5: error: unknown type name 'simple_schema'
998 | simple_schema ss;
| ^
```
apparently, `simple_schema`'s declaration is not available in this
translation unit.
in this change
* we include the header where `simple_schema` is defined, so that
the build passes with clang-20.
* also take this opportunity to reorder the header a little bit,
so the testing headers are grouped together.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20367
in a recent seastar change (644bb662), we do not include
`seastar/testing/random.hh` in `seastar/testing/test_runner.hh` anymore,
as the latter is not a facade of the former, and neither does it use the
former. as a sequence, some tests which take the advantage of the
included `seastar/testing/random.hh` do not build with the latest
seastar:
```
FAILED: test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o
/usr/bin/clang++ -DBOOST_REGEX_DYN_LINK -DBOOST_REGEX_NO_LIB -DBOOST_UNIT_TEST_FRAMEWORK_DYN_LINK -DBOOST_UNIT_TEST_FRAMEWORK_NO_LIB -DDEVEL -DFMT_SHARED -DSCYLLA_BUILD_MODE=dev -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -DSEASTAR_API_LEVEL=7 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/__w/scylladb/scylladb -I/__w/scylladb/scylladb/build/gen -I/__w/scylladb/scylladb/seastar/include -I/__w/scylladb/scylladb/build/seastar/gen/include -I/__w/scylladb/scylladb/build/seastar/gen/src -I/__w/scylladb/scylladb/build -isystem /__w/scylladb/scylladb/abseil -isystem /__w/scylladb/scylladb/build/rust -O2 -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/__w/scylladb/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -MD -MT test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o -MF test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o.d -o test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o -c /__w/scylladb/scylladb/test/lib/key_utils.cc
In file included from /__w/scylladb/scylladb/test/lib/key_utils.cc:11:
/__w/scylladb/scylladb/test/lib/random_utils.hh:25:30: error: no member named 'local_random_engine' in namespace 'seastar::testing'
25 | return seastar::testing::local_random_engine;
| ~~~~~~~~~~~~~~~~~~^
1 error generated.
```
in this change, we include `seastar/testing/random.hh` when the random
facility is used, so that they can be compiled with the latest seastar
library.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20368
under the hood, std::map::count() and std::map::contains() are nearly
identical. both operations search for the given key witin the map.
however, the former finds a equal range with the given
key, and gets the distance between the disntance between the begin
and the end of the range; while the later just searches with the given
key.
since scylla-nodetool is not a performance-critical application, the
minor difference in efficiency between these two operations is unlikely
to have a significant impact on its overall performance.
while std::map::count() is generally suitable for our need, it might be
beneficial to use a more appropriate API.
in this change, we use std::map::contains() in the place of
std::map::count() when checking for the existence of a paramter with
given name.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20350
for better readability
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#20366
* github.com:scylladb/scylladb:
compaction: use std::views::reverse when appropriate
compaction: use structured binding when appropriate
Bind variables in CQL have two formats: positional (`?`) where a
variable is referred to by its relative position in the statement,
and named (`:var`), where the user is expected to supply a
name->value mapping.
In 19a6e69001 we identified the case where a named bind variable
appears twice in a query, and collapsed it to a single entry in the
statement metadata. Without this, a driver using the named variable
syntax cannot disambiguate which variable is referred to.
However, it turns out that users can use the positional call form
even with the named variable syntax, by using the positional
API of the driver. To support this use case, we add a configuration
variable to disable the same-variable detection.
Because the detection has to happen when the entire statement is
visible, we have to supply the configuration to the parser. We
call it the `dialect` and pass it from all callers. The alternative
would be to add a pre-prepare call similar to fill_prepare_context that
rewrites all expressions in a statement to deduplicate variables.
A unit test is added.
Fixes#15559
All seeds hostname resolution errors will be ignored during a node
restart in case the node had already joined a cluster. This will
prevent restart errors if some seed names are not resolvable.
Fixesscylladb/scylladb#14945Closesscylladb/scylladb#20292
* github.com:scylladb/scylladb:
Ignore seed name resolution errors on restart.
Add a test for starting with a wrong seed.
Currently if a coordinator and a node being replaced are in the same DC
while inter-dc encryption is enabled (connections between nodes in the
same DC should not be encrypted) the replace operation will fail. It
fails because a coordinator uses non encrypted connection to push raft
data to the new node, but the new node will not accept such connection
until it knows which DC the coordinator belongs to and for that the raft
data needs to be transferred.
The series adds the test for this scenario and the fix for the
chicken&egg problem above.
The series (or at least the fix itself) needs to be backported because
this is a serious regression.
Fixes: scylladb/scylladb#19025Closesscylladb/scylladb#20290
* github.com:scylladb/scylladb:
topology coordinator: fix indentation after the last patch
topology coordinator: do not add replacing node without a ring to topology
test: add test for replace in clusters with encryption enabled
test.py: add server encryption support to cluster manager
.gitignore: fix pattern for resources to match only one specific directory
before this change, we run all the tests in a single pytest session,
with scylladb debug symbols loaded. but we want to test another use
case, where the scylladb debug symbols are missing.
in this change,
* we do not check for the existence of debug symbols until necessary
* add a mark named "without_scylla"
* run the tests in two pytest sessions
- one with "without_scylla" mark
- one with "not without_scylla" mark
* add a test which is marked with the "without_scylla" mark. the test
verify that the scylla-gdb.py script can be loaded even without
scylladb debug symbols.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of evaluating the constants in-class, accessing them via
a cached class property.
it would be handy if we could source `scylla-gdb.py` in `.gdbinit`,
but this script accesses some symbols which are not available with
a file being debugged. so when gdb fails to load init script:
```
Traceback (most recent call last):
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 167, in <module>
class intrusive_slist:
File "/home/kefu/dev/scylladb/scylla-gdb.py", line 168, in intrusive_slist
size_t = gdb.lookup_type('size_t')
^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.error: No type named size_t.
```
so we have to `file path/to/scylla` and *then*
`source scylla-gdb.py` every time when we debug scylla or a seastar
application, instead of loading `scylla-gdb.py` in `.gdbinit`.
the reason is that the script access the debug symbols like
`gdb.lookup_type('size_t')` in-class. so when the python interpreter
reads the script, it evaluates this statement, but at that moment,
the debug symbols are not loaded, so `source scylla-gdb.py` fails
in `.gdbinit`.
in this change, we transform all these class variables to cached
property, so that they
* are evaluated on-demand
* are evaluated only once at most
this addresses the pain at the expense of verbosity.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.
Throw if batchlog manager isn't initialized.
Fixes: #20236.
Needs backport to 6.0 and 6.1 as they suffer from the uninitialized bm access.
Closesscylladb/scylladb#20251
* github.com:scylladb/scylladb:
test: add test to ensure repair won't fail with uninitialized bm
repair: throw if batchlog manager isn't initialized
This commit replaces the 6.0-to-6.1 upgrade guide with the 6.1-to-6.2 upgrade guide.
The new guide is a template that covers the basic procedure.
If any 6.2-specific updates are required, they will have to be added along with development.
Closesscylladb/scylladb#20178
In scylladb/scylladb@7301a96, in the function `hint_endpoint_manager::store_hint()`,
we transformed the lambda passed to `seastar::with_gate()` to a coroutine lambda
to improve the readability. However, there was a subtle problem related to
lifetimes of the captures that needed to be addressed:
* Since we started `co_await`ing in the lambda, the captures were at risk of
being destructed too soon. The usual solution is to wrap a coroutine lambda
within a `seastar::coroutine::lambda` object and rely on the extended lifetime
enforced by the semantics of the language.
See `docs/dev/lambda-coroutine-fiasco.md` for more context.
* However, since we don't immediately `co_await` the future returned by
`with_gate()`, we cannot rely on the extended lifetime provided by the wrapper.
The document linked in the previous bullet point suggests keeping the passed
coroutine lambda as a variable and pass it as a reference to `with_gate()`.
However, that's not feasible either because we discard the returned future and
the function returns almost instantly -- destructing every local object, which
would encompass the lambda too.
The solution used in the commit was to move captures of the lambda into
the lambda's body. That helped because Seastar's backend is responsible for
keeping all of the local variables alive until the lambda finishes its execution.
However, we didn't move all of the captures into the lambda -- the missing one
was the `this` pointer that was implicitly used in the lambda.
Address sanitiser hasn't reported any bugs related to the pointer yet, but
the bug is most likely there.
In this commit, we transform the lambda's body into a new member function
and only call it from the lambda. This way, we don't need to care about
the lifetimes of the captures because Seastar ensures that the function's
arguments stay alive until the coroutine finishes.
Choosing this solution instead of assigning `this` to a pointer variable
inside the lambda's body and using it to refer to the object's members
has actual benefit: it's not possible to accidentally forget to refer
to a member of the object via the pointer; it also makes the code less
awkward.
Fixesscylladb/scylladb#20306Closesscylladb/scylladb#20258
* github.com:scylladb/scylladb:
db/hints: Fix indentation in `do_store_hint()`
db/hints: Move code for writing hints to separate function
Add nodetool commands to manage task manager tasks:
- tasks abort - aborts the task
- tasks list - lists all tasks in the module
- tasks modules - lists all modules
- tasks set-ttl - sets task ttl
- tasks status - gets status of the task
- tasks tree - gets statuses of the task and all its desendent's
- tasks ttl - gets task ttl
- tasks wait - waits for the task and gets its status
Fixes: https://github.com/scylladb/scylladb/issues/19201.
Closesscylladb/scylladb#19614
* github.com:scylladb/scylladb:
test: nodetool: add tests for tasks commands
nodetool: tasks: add nodetool commands to track task manager tasks
api: task_manager: return status 403 if a task is not abortable
api: task_manager: return none instead of empty task id
api: task_manager: add timeout to wait_task
api: task_manager: add operation to get ttl
nodetool: add suboperations support
nodetool: change operations_with_func type
nodetool: prepare operation related classes for suboperations
A dialect is a different way to interpret the same CQL statement.
Examples:
- how duplicate bind variable names are handled (later in this series)
- whether `column = NULL` in LWT can return true (as is now) or
whether it always returns NULL (as in SQL)
Currently, dialect is an empty structure and will be filled in later.
It is passed to query_processor methods that also accept a CQL string,
and from there to the parser. It is part of the prepared statement cache
key, so that if the dialect is changed online, previous parses of the
statement are ignored and the statement is prepared again.
The patch is careful to pick up the dialect at the entry point (e.g.
CQL protocol server) so that the dialect doesn't change while a statement
is parsed, prepared, and cached.
~~~
generic_server: convert connection tracking to seastar::gate
If we call server::stop() right after "server" construction, it hangs:
With the server never listening (never accepting connections and never
serving connections), nothing ever calls server::maybe_stop().
Consequently,
co_await _all_connections_stopped.get_future();
at the end of server::stop() deadlocks.
Such a server::stop() call does occur in controller::do_start_server()
[transport/controller.cc], when
- cserver->start() (sharded<cql_server>::start()) constructs a
"server"-derived object,
- start_listening_on_tcp_sockets() throws an exception before reaching
listen_on_all_shards() (for example because it fails to set up client
encryption -- certificate file is inaccessible etc.),
- the "deferred_action"
cserver->stop().get();
is invoked during cleanup.
(The cserver->stop() call exposing the connection tracking problem dates
back to commit ae4d5a60ca ("transport::controller: Shut down distributed
object on startup exception", 2020-11-25), and it's been triggerable
through the above code path since commit 6b178f9a4a
("transport/controller: split configuring sockets into separate
functions", 2024-02-05).)
Tracking live connections and connection acceptances seems like a good fit
for "seastar::gate", so rewrite the tracking with that. "seastar::gate"
can be closed (and the returned future can be waited for) without anyone
ever having entered the gate.
NOTE: this change makes it quite clear that neither server::stop() nor
server::shutdown() must be called multiple times. The permitted sequences
are:
- server::shutdown() + server::stop()
- or just server::stop().
Fixes#10305
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
~~~
Fixes#10305.
I think we might want to backport this -- it fixes a hang-on-misconfiguration which affects `scylla-6.1.0-0.20240804.abbf0b24a60c.x86_64` minimally. Basically every release that contains commit ae4d5a60ca has a theoretical chance for the hang, and every release that contains commit 6b178f9a4a has a practical chance for the hang.
Focusing on the more practical symptom (i.e., releases containing commit 6b178f9a4a), `git tag --contains 6b178f9a4a90` gives us (ignoring candidates and release candidates):
- scylla-6.0.0
- scylla-6.0.1
- scylla-6.0.2
- scylla-6.1.0
Closesscylladb/scylladb#20212
* github.com:scylladb/scylladb:
generic_server: make server::stop() idempotent
generic_server: coroutinize server::shutdown()
generic_server: make server::shutdown() idempotent
test/generic_server: add test case
configure, cmake: sort the lists of boost unit tests
generic_server: convert connection tracking to seastar::gate
If there's a token metadata for a given table, and it is in split mode,
it will be registered such that split monitor can look at it, for
example, to start split work, or do nothing if table completed it.
during topology change, e.g. drain, split is stalled since it cannot
take over the state machine.
It was noticed that the log is being spammed with a message saying the
table completed split work, since every tablet metadata update, means
waking up the monitor on behalf of a table. So it makes sense to
demote the logging level to debug. That persists until drain completes
and split can finally complete.
Another thing that was noticed is that during drain, a table can be
submitted for processing faster than the monitor can handle, so the
candidate queue may end up with multiple duplicated entries for same
table, which means unnecessary work. That is fixed by using a
sequenced set, which keeps the current FIFO behavior.
Fixes#20339.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#20029
Handed over from https://github.com/scylladb/scylladb/pull/20149
This adds minimal implementation of the start-restore API call.
The method starts a task that runs load-and-stream functionality against sstables from S3 bucket. Arguments are:
```
endpoint -- the ID in object_store.yaml config file
bucket -- the target bucket to get objects from
keyspace -- the keyspace to work on
table -- the table to work on
snapshot -- the name of the snapshot from which the backup was taken
```
The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion.
Remote sstables components are scanned as if they were placed in local upload/ directory. Then colelcted sstables are fed into load-and-stream.
This branch has https://github.com/scylladb/scylladb/pull/19890 (Integrated backup), https://github.com/scylladb/scylladb/pull/20120 (S3 lister) and few more minor PRs merged in. The restore branch itself starts with [utils: Introduce abstract (directory) lister](29c867b54d) commit.
refs: https://github.com/scylladb/scylladb/issues/18392Closesscylladb/scylladb#20305
* github.com:scylladb/scylladb:
tools/scylla-nodetool: add restore integration
test/object_store: Add simple restore test
test/object_store: Generalize prepare_snapshot_for_backup()
code: Introduce restore API method
sstable_loader: Add sstables::storage_manager dependency
sstable_loader: Maintain task manager module
sstable_loader: Out-line constructor
distributed_loader: Split get_sstables_from_upload_dir()
sstables/storage: Compose uploaded sstable path simpler
sstable_directory: Prepare FS lister to scan files on S3
sstable_directory: Parse sstable component without full path
s3-client: Add support for lister::filter
utils: Introduce abstract (directory) lister
We revive the `join_ring` option. We support it only in the
Raft-based topology, as we plan to remove the gossip-based topology
when we fix the last blocker - the implementation of the manual
recovery tool. In the Raft-based topology, a node can be assigned
tokens only once when it joins the cluster. Hence, we disallow
joining the ring later, which is possible in Cassandra.
The main idea behind the solution is simple. We make the unsupported
special case of zero tokens a supported normal case. Nodes with zero
tokens assigned are called "zero-token nodes" from now on.
From the topology point of view, zero-token nodes are the same as
token-owning nodes. They can be in the same states, etc. From the
data point of view, they are different. They are not members of
the token ring, so they are not present in
`token_metadata::_normal_token_owners`. Hence, they are ignored in
all non-local replication strategies. The tablet load balancer also
ignores them.
Zero-token nodes can be used as coordinator-only nodes, just like in
Cassandra. They can handle requests just like token-owning nodes.
The main motivation behind zero-token nodes is that they can prevent
the Raft majority loss efficiently. Zero-token nodes are group 0
voters, but they can run on much weaker and cheaper machines because
they do not replicate data and handle client requests by default
(drivers ignore them). For example, if there are two DCs, one with 4
nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes,
every DC will contain less than half of the nodes, so we won't lose
the majority when any DC dies.
Another way of preventing the Raft majority loss is changing the
voter set, which is tracked by scylladb/scylladb#18793. That approach
can be used together with zero-token nodes. In the example above, if
we choose equal numbers of voters in both DCs, then a DC with one
zero-token node will be sufficient. However, in the typical setup of
2 DCs with the same number of nodes it is enough to add a DC with
only one zero-token node without changing the voter set.
Zero-token nodes could also be used as load balancers in the
Alternator.
Additionally, this PR fixesscylladb/scylladb#11087, which turned out to
be a blocker.
This PR introduced a new feature. There is no need to backport it.
Fixesscylladb/scylladb#6527Fixesscylladb/scylladb#11087Fixesscylladb/scylladb#15360Closesscylladb/scylladb#19684
* github.com:scylladb/scylladb:
docs: raft: document using zero-token nodes to prevent majority loss
test: test recovery mode in the presence of zero-token nodes
test: topology: util.py: add cqls parameter to check_system_topology_and_cdc_generations_v3_consistency
test: topology: util.py: accept zero tokens in check_system_topology_and_cdc_generations_v3_consistency
treewide: support zero-token nodes in the recovery mode
storage_proxy: make TRUNCATE work locally for local tables
test: topology: util.py: document that check_token_ring_and_group0_consistency fails with zero-token nodes
test: test zero-token nodes
test: test_topology_ops: move helpers to topology/util.py
feature_service: introduce the ZERO_TOKEN_NODES feature
storage_service: rename join_token_ring to join_topology
storage_service: raft_topology_cmd_handler: improve warnings
topology_coordinator: fix indentation after the previous patch
treewide: introduce support for zero-token nodes in Raft topology
system_keyspace: load_topology_state: remove assertion impossible to hit
treewide: distinguish all nodes from all token owners
gossip topology: make a replacing node remove the replaced node from topology
locator: topology: add_or_update_endpoint: use none as the default node state
test: boost: tablets tests: ensure all nodes are normal token owners
token_metadata: rename get_all_endpoints and get_all_ips
network_topology_strategy: reallocate_tablets: remove unused dc_rack_nodes
virtual_tables: cluster_status_table: execute: set dc regardless of the token ownership
When only inter dc encryption is enabled a non encrypted connection
between two nodes is allowed only if both nodes are in the same dc.
If a nodes that initiates the connection knows that dst is in the same
dc and hence use non encrypted connection, but the dst not yet knows the
topology of the src such connection will not be allowed since dst cannot
guaranty that dst is in the same dc.
Currently, when topology coordinator is used, a replacing node will
appear in the coordinator's topology immediately after it is added to the
group0. The coordinator will try to send raft message to the new node
and (assuming only inter dc encryption is enabled and replacing node and
the coordinator are in the same dc) it will try to open regular, non encrypted,
connection to it. But the replacing node will not have the coordinator
in it's topology yet (it needs to sync the raft state for that). so it
will reject such connection.
To solve the problem the patch does not add a replacing node that was
just added to group0 to the topology. It will be added later, when
tokens will be assigned to it. At this point a replacing node will
already make sure that its topology state is up-to-date (since it will
execute a raft barrier in join_node_response_params handler) and it knows
coordinator's topology. This aligns replace behaviour with bootstrap
since bootstrap also does not add a node without a ring to the topology.
The patch effectively reverts b8ee8911caFixes: scylladb/scylladb#19025
In scylladb/scylladb@7301a96, in the function `hint_endpoint_manager::store_hint()`,
we transformed the lambda passed to `seastar::with_gate()` to a coroutine lambda
to improve the readability. However, there was a subtle problem related to
lifetimes of the captures that needed to be addressed:
* Since we started `co_await`ing in the lambda, the captures were at risk of
being destructed too soon. The usual solution is to wrap a coroutine lambda
within a `seastar::coroutine::lambda` object and rely on the extended lifetime
enforced by the semantics of the language.
See `docs/dev/lambda-coroutine-fiasco.md` for more context.
* However, since we don't immediately `co_await` the future returned by
`with_gate()`, we cannot rely on the extended lifetime provided by the wrapper.
The document linked in the previous bullet point suggests keeping the passed
coroutine lambda as a variable and pass it as a reference to `with_gate()`.
However, that's not feasible either because we discard the returned future and
the function returns almost instantly -- destructing every local object, which
would encompass the lambda too.
The solution used in the commit was to move captures of the lambda into
the lambda's body. That helped because Seastar's backend is responsible for
keeping all of the local variables alive until the lambda finishes its execution.
However, we didn't move all of the captures into the lambda -- the missing one
was the `this` pointer that was implicitly used in the lambda.
Address sanitiser hasn't reported any bugs related to the pointer yet, but
the bug is most likely there.
In this commit, we transform the lambda's body into a new member function
and only call it from the lambda. This way, we don't need to care about
the lifetimes of the captures because Seastar ensures that the function's
arguments stay alive until the coroutine finishes.
Choosing this solution instead of assigning `this` to a pointer variable
inside the lambda's body and using it to refer to the object's members
has actual benefit: it's not possible to accidentally forget to refer
to a member of the object via the pointer; it also makes the code less
awkward.
Modify operation and add operation_action class so that information
about suboperations is stored. It's a preparation for adding
suboperations support to nodetool.
before this change, we included `-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.`
in cflags when building the tree with CMake, but this was wrong.
as the "." directory is the build directory used by CMake. and this
directory is specified by the `-B` option when generating the building
system. if `configure.py --use-cmake` is used to build the tree,
the build directory would be "build". so this option instructs the compiler
to replace the directory of source file in the debug symbols and in
`__FILE__` at compile time.
but, in a typical workspace, for instance, `build/main.cc` does not exist.
the reason why this does not apply to CMake but applies to the rules
generated by `configure.py` is that, `configure.py` puts the generated
`build.ninja` right under the top source directory, so `.` is correct and
it helps to create reproducible builds. because this practically erases
the path prefixes in the build output. while CMake puts it under the
specified build directory, replacing the source directory with the build
directory with the file prefix map is just wrong.
there are two options to address this problem:
* stop passing this option. but this would lead to non-reproducible
builds. as we would encode the build directory in the "scylla"
executable. if a developer needs to rebuild an executable for debugging
a coredump generated in production, he/she would have to either
build the tree in the same directory as our CI does. or, he/she
has to pass `-ffile-prefix-map=...` to map the local build directory
to the one used by CI. this is not convenient.
* instead of using `${CMAKE_SOURCE_DIR}=.`, add `${CMAKE_BINARY_DIR}=.`.
this erases the build directory in the outputs, but preserves the
debuggability.
so we pick the second solution.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20329
We modify existing tests to verify that the recovery mode works
correctly in the presence of zero-token nodes.
In `test_topology_recovery_basic`, we test the case when a
zero-token node is live. In particular, we test that the
gossip-based restart of such a node works.
In `test_topology_recovery_after_majority_loss`, we test the case
when zero-token nodes are unrecoverable. In particular, we test
that the gossip-based removenode of such nodes works.
Since zero-token nodes are ignored by the Python driver if it also
connects to other nodes, we use different CQL sessions for a
zero-token node in `test_topology_recovery_basic`.
In the following commit, we modify `test_topology_recovery_basic`
to test the recovery mode in the presence of live zero-token nodes.
Unfortunately, it requires a bit ugly workaround. Zero-token nodes
are ignored by the Python driver if it also connects to other
nodes because of empty tokens in the `system.peers` table. In that
test, we must connect to a zero-token node to enter the recovery
mode and purge the Raft data. Hence, we use different CQL sessions
for different nodes.
In the future, we may change the Python driver behavior and revert
this workaround. Moreover, the recovery tests will be removed or
significantly changed when we implement the manual recovery tool.
Therefore, we shouldn't worry about this workaround too much.
Before we use `check_system_topology_and_cdc_generations_v3_consistency`
in a test with a zero-token node, we must ensure it doesn't fail
because of zero tokens in a row of the `system.topology` table.
Before we implement the manual recovery tool, we must support
zero-token nodes in the recovery mode. This means that two topology
operations involving zero-token nodes must work in the gossip-based
topology:
- removing a dead zero-token node,
- restarting a live zero-token node.
We make changes necessary to make them work in this patch.
In on of the following patches, we implement support for zero-token
nodes in the recovery mode. To achieve this, we need to be able to
purge all Raft data on live zero-token nodes by using TRUNCATE.
Currently, TRUNCATE works the same for all replication strategies - it
is performed on all token owners. However, zero-token nodes are not
token owners, so TRUNCATE would ignore them. Since zero-token nodes
store only local tables, fixing scylladb/scylladb#11087 is the perfect
solution for the issue with zero-token nodes. We do it in this patch.
Fixesscylladb/scylladb#11087
We add tests to verify the basic properties of zero-token nodes.
`test_zero_token_nodes_no_replication` and
`test_not_enough_token_owners` are more or less deterministic tests.
Running them only in the dev mode is sufficient.
`test_zero_token_nodes_topology_ops` is quite slow, as expected,
considering parameterization and the number of topology operations.
In the future we can think of making it faster or skipping in the
debug mode. For now, our priority is to test zero-token nodes
thoroughly.
In one of the following patches, we reuse the helper functions from
`test_topology_ops` in a new test, so we move them to `util.py`.
Also, we add the `cl` parameter to `start_writes`, as the new test
will use `cl=2`.
Zero-token nodes must be supported by all nodes in the cluster.
Otherwise, the non-supporting nodes would crash on some assertion
that assumes only token-owing normal nodes make sense.
Hence, we introduce the ZERO_TOKEN_NODES cluster feature. Zero-token
nodes refuse to boot if it is not supported.
I tested this patch manually. First, I booted a node built in the
previous patch. Then, I tried to add a zero-token node built in this
patch. It refused to boot as expected.
We revive the `join_ring` option. We support it only in the
Raft-based topology, as we plan to remove the gossip-based topology
when we fix the last blocker - the implementation of the manual
recovery tool. In the Raft-based topology, a node can be assigned
tokens only once when it joins the cluster. Hence, we disallow
joining the ring later, which is possible in Cassandra.
The main idea behind the solution is simple. We make the unsupported
special case of zero tokens a supported normal case. Nodes with zero
tokens assigned are called "zero-token nodes" from now on.
From the topology point of view, zero-token nodes are the same as
token-owning nodes. They can be in the same states, etc. From the
data point of view, they are different. They are not members of
the token ring, so they are not present in
`token_metadata::_normal_token_owners`. Hence, they are ignored in
all non-local replication strategies. The tablet load balancer also
ignores them.
Topology operations involving zero-token nodes are simplified:
- `add` and `replace` finish in the `join_group0` state, so creating
a new CDC generation and streaming are skipped,
- `removenode` and `decommission` skip streaming,
- `rebuild` does not even contact the topology coordinator as there
is nothing to rebuild,
Also, if the topology operation involves a token-owning node,
zero-token nodes are ignored in streaming.
Zero-token nodes can be used as coordinator-only nodes, just like in
Cassandra. They can handle requests just like token-owning nodes.
The main motivation behind zero-token nodes is that they can prevent
the Raft majority loss efficiently. Zero-token nodes are group 0
voters, but they can run on much weaker and cheaper machines because
they do not replicate data and handle client requests by default
(drivers ignore them). For example, if there are two DCs, one with 4
nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes,
every DC will contain less than half of the nodes, so we won't lose
the majority when any DC dies.
Another way of preventing the Raft majority loss is changing the
voter set, which is tracked by scylladb/scylladb#18793. That approach
can be used together with zero-token nodes. In the example above, if
we choose equal numbers of voters in both DCs, then a DC with one
zero-token node will be sufficient. However, in the typical setup of
2 DCs with the same number of nodes it is enough to add a DC with
only one zero-token node without changing the voter set.
Zero-token nodes could also be used as load balancers in the
Alternator.
In one of the following patches, we introduce support for zero-token
nodes. From that point, getting all nodes and getting all token
owners isn't equivalent. In this patch, we ensure that we consider
only token owners when we want to consider only token owners (for
example, in the replication logic), and we consider all nodes when
we want to consider all nodes (for example, in the topology logic).
The main purpose of this patch is to make the PR introducing
zero-token nodes easier to review. The patch that introduces
zero-token nodes is already complicated. We don't want trivial
changes from this patch to make noise there.
This patch introduces changes needed for zero-token nodes only in the
Raft-based topology and in the recovery mode. Zero-token nodes are
unsupported in the gossip-based topology outside recovery.
Some functions added to `token_metadata` and `topology` are
inefficient because they compute a new data structure in every call.
They are never called in the hot path, so it's not a serious problem.
Nevertheless, we should improve it somehow. Note that it's not
obvious how to do it because we don't want to make `token_metadata`
store topology-related data. Similarly, we don't want to make
`topology` store token-related data. We can think of an improvement
in a follow-up.
We don't remove unused `topology::get_datacenter_rack_nodes` and
`topology::get_datacenter_nodes`. These function can be useful in the
future. Also, `topology::_dc_nodes` is used internally in `topology`.
In the following patch, we change the gossiper to work the same for
zero-token nodes and token-owning nodes. We replace occurrences of
`is_normal_token_owner` with topology-based conditions. We want to
rely on the invariant that token-owning nodes own tokens if and only
if they are in the normal or leaving state. However, this invariant
is broken by a replacing node because it does not remove the
replaced node from topology. Hence, after joining, the replacing node
has topology with a node that is not a token owner anymore but is
in a leaving state (`being_replaced`). We fix it to prevent the
following patch from introducing a regression.
In one of the following patches, we change the gossiper to work the
same for zero-token nodes and token-owning nodes. We replace
occurrences of `is_normal_token_owner` with topology-based
conditions. We want to rely on the invariant that token-owning nodes
own tokens if and only if they are in the normal or leaving state.
However, this invariant can be broken in the gossip-based topology
when a new node joins the cluster. When a boostrapping node starts
gossiping, other nodes add it to their topology in
`storage_service::on_alive`. Surprisingly, the state of the new node
is set to `normal`, as it's the default value used by
`add_or_update_endpoint`. Later, the state will be set to
`bootstrapping` or `replacing`, and finally it will be set again to
`normal` when the join operation finishes. We fix this strange
behavior by setting the node state to `none` in
`storage_service::on_alive` for nodes not present in the topology.
Note that we must add such nodes to the topology. Other code needs
their Host ID, IP, and location.
We change the default node state from `normal` to `none` in
`add_or_update_endpoint` to prevent bugs like the one in
`storage_service::on_alive`. Also, we ensure that nodes in the `none`
state are ignored in the getters of `locator::topology`.
In one of the following patches, we make NetworkTopologyStrategy
and the tablet load balancer consider only normal token owners to
ensure they ignore zero-token nodes. Some unit tests would start
failing after this change because they do not ensure that all
nodes are normal token owners. This patch prevents it.
Judging by the logic in the test cases in
`network_topology_strategy_test`, `point++` was probably intended
anyway.
In one of the following patches, we introduce support for zero-token
nodes. A zero-token node that has successfully joined the cluster is
in the normal state but is not a normal token owner. Hence, the names
of `get_all_endpoints` and `get_all_ips` become misleading. They
should specify that the functions return only IDs/IPs of token owners.
before this change, memtable serves as the fixture for 6 test cases,
actually these 6 test cases can be categorized into a matrix of 3 x 2:
{ single_row, multi_row, large_partition } x { single_partition, multi_paritition }.
in this change, we break memtable into 3 different fixtures, to reflect
this fact. more readable this way. and a benefit is that each test does
not have to pay for the overhead of setup it does not use at all.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20177
as part of the efforts to address scylladb/scylladb#2717, we are
switching over to the CMake-based building system, and fade out the
mechinary to create the rules manually in `configure.py`.
in this change, we add `--no-use-cmake` to `configure.py`, it serves
two purposes:
* prepare for the change which enables cmake by default, by then,
we would set the default value of `use_cmake` to True, and allow
user to keep using the existing mechinary in the transition period
using `--no-use-cmake`.
* allows the CI to tell if a tree is able to build with CMake.
the command line option of `--use-cmake` is also used by the CI
workflows, and is passed to `configure.py` if `BUILD_WITH_CMAKE`
jenkins pipeline parameter is set. but not all branches with
`--use-cmake` are ready to build with CMake -- only the latest
master HEAD is ready. so the CI needs to check the capability of
building with CMake by looking at the output of `configure.py --help`,
to see if it includes `--no-use-cmake`.
after this change lands. we will remove the `BUILD_WITH_CMAKE`
parameter, and use cmake as long as `configure.py` supports
`--no-use-cmake` option.
the existing mechinary will stay with us for a short transition
period so that developers can take time to get used to the
usage of the naming of targets and the new directory arrangement.
as a side effect, #20079 will be fixed after switching to CMake.
---
this is a cmake-related change, hence no need to backport.
Closesscylladb/scylladb#20261
* github.com:scylladb/scylladb:
build: add --no-use-cmake option to configure.py
build: let configure.py fail if unknown option is passed to it
The test test_streams.py::test_stream_list_tables reproduces a bug where
enabling streams added a spurious result to ListTables. A reviewer of
that patch asked to also add a check that name of the table itself
doesn't disappear from ListTables when a stream is enabled, so this is
what this patch adds.
This theoretical scenario (a table's name disappearing from ListTables)
never happened, so the new check doesn't reproduce any known bug, but
I guess it never hurts to make the test stronger for regression testing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19934
The code tries to build as "neighbors" an unordered_map from an iterator
of std::tuple, instead of the correct std::pair. Apparently, the tuples
are transparently converted to pairs on the newest compilers and the
whole works, but on slightly older compilers (like the one on Fedora 39)
Scylla no longer compiles - the compiler complains it can't convert a
tuple to a pair in this context.
So fix the code to use pairs, not tuples, and it fixes the build on
Fedora 39.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#20319
as we have an API for restore a keyspace / table, let's expose this feature
with nodetool. so we can exercise it without the help of scylla-manager
or 3rd-party tools with a user-friendly interface.
in this change:
* add a new subcommand named "restore" to nodetool
* add test to verify its interaction with the API server
* update the document accordingly.
* the bash completion script is updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The test shows how to restore previously backed up table:
- backup
- truncate to get rid of existing sstables
- start restore with the new API method
- wait for the task to finish
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method starts a task that uses sstables_loader load-and-stream
functionality to bring new sstables into the cluster. The existing
load-and-stream picks up sstables from upload/ directory, the newly
introduced task collects them from S3 bucket and given prefix (that
correspond to the path where backup API method put them).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Gossiper seeds host name resolution failures are ignored during restart if
a node is already boostrapped (i.e. it has successfully joined the cluster).
Fixesscylladb/scylladb#14945
Check whether we can stop a generic server without first asking it to
listen.
The test fails currently; the failure mode is a hang, which triggers the 5
minute timeout set in the test:
> unknown location(0): fatal error: in "stop_without_listening":
> seastar::timed_out_error: timedout
> seastar/src/testing/seastar_test.cc(43): last checkpoint
> test/boost/generic_server_test.cc(34): Leaving test case
> "stop_without_listening"; testing time: 300097447us
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Both lists were obviously meant to be sorted originally, but by today
we've introduced many instances of disorder -- thus, inserting a new test
in the proper place leaves the developer scratching their head. Sort both
lists.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
If we call server::stop() right after "server" construction, it hangs:
With the server never listening (never accepting connections and never
serving connections), nothing ever calls server::maybe_stop().
Consequently,
co_await _all_connections_stopped.get_future();
at the end of server::stop() deadlocks.
Such a server::stop() call does occur in controller::do_start_server()
[transport/controller.cc], when
- cserver->start() (sharded<cql_server>::start()) constructs a
"server"-derived object,
- start_listening_on_tcp_sockets() throws an exception before reaching
listen_on_all_shards() (for example because it fails to set up client
encryption -- certificate file is inaccessible etc.),
- the "deferred_action"
cserver->stop().get();
is invoked during cleanup.
(The cserver->stop() call exposing the connection tracking problem dates
back to commit ae4d5a60ca ("transport::controller: Shut down distributed
object on startup exception", 2020-11-25), and it's been triggerable
through the above code path since commit 6b178f9a4a
("transport/controller: split configuring sockets into separate
functions", 2024-02-05).)
Tracking live connections and connection acceptances seems like a good fit
for "seastar::gate", so rewrite the tracking with that. "seastar::gate"
can be closed (and the returned future can be waited for) without anyone
ever having entered the gate.
NOTE: this change makes it quite clear that neither server::stop() nor
server::shutdown() must be called multiple times. The permitted sequences
are:
- server::shutdown() + server::stop()
- or just server::stop().
Fixes#10305
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
as part of the efforts to address scylladb/scylladb#2717, we are
switching over to the CMake-based building system, and fade out the
mechinary to create the rules manually in `configure.py`.
in this change, we add `--no-use-cmake` to `configure.py`, it serves
two purposes:
* prepare for the change which enables cmake by default, by then,
we would set the default value of `use_cmake` to True, and allow
user to keep using the existing mechinary in the transition period
using `--no-use-cmake`.
* allows the CI to tell if a tree is able to build with CMake.
the command line option of `--use-cmake` is also used by the CI
workflows, and is passed to `configure.py` if `BUILD_WITH_CMAKE`
jenkins pipeline parameter is set. but not all branches with
`--use-cmake` are ready to build with CMake -- only the latest
master HEAD is ready. so the CI needs to check the capability of
building with CMake by looking at the output of `configure.py --help`,
to see if it includes --no-use-cmake`.
after this change lands. we will remove the `BUILD_WITH_CMAKE`
parameter, and use cmake as long as `configure.py` supports
`--no-use-cmake` option.
the existing mechinary will stay with us for a short transition
period so that developers can take time to get used to the
usage of the naming of targets and the new directory arrangement.
as a side effect, #20079 will be fixed after switching to CMake.
Refs scylladb/scylladb#2717
Refs scylladb/scylladb#20079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this allows us to use `configure.py` to tell if a certain argument is supported
without parsing its output. in the next commit, we will add `--no-use-cmake` option,
which will be used to tell if the tree is ready for using CMake for its building
system.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in `configure.py`, a set of options are specified when configuring
seastar, but not all of them were ported to scylla's CMake building
system.
for instance, `configure.py` explicitly disables io_uring reactor
backend at build time, but the CMake-based system does not.
so, in this change, in order to preserve the existing behavior, let's
port the two previously missing option to CMake-based building system
as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20288
Attempting to read a partition via `SELECT * FROM MUTATION_FRAGMENTS()`, which the node doesn't own, from a table using tablets causes a crash.
This is because when using tablets, the replica side simply doesn't handle requests for un-owned tokens and this triggers a crash.
We should probably improve how this is handled (an exception is better than a crash), but this is outside the scope of this PR.
This PR fixes this and also adds a reproducer test.
Fixes: https://github.com/scylladb/scylladb/issues/18786
Fixes a regression introduced in 6.0, so needs backport to 6.0 and 6.1
Closesscylladb/scylladb#20109
* github.com:scylladb/scylladb:
test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
replica/mutation_dump: enfore pinning of effective replication map
replica/mutation_dump: handle un-owned tokens (with tablets)
When executing reversed queries, a native revered format shall be used. Therefore, the table schema and the clustering key bounds are reversed before a partition slice and a read command are constructed.
It is, however, possible to run a reversed query passing a table schema but only when there are no restrictions on the clustering keys. In this particular situation, the query returns correct results. Since the current alternator tests in test.py do not imply any restrictions, this situation was not caught during development of https://github.com/scylladb/scylladb/pull/18864.
Hence, additional tests are provided that add clustering keys restrictions when executing reversed queries to capture such errors earlier than in dtests.
Additional manual tests were performed to test a mixed-node cluster (with alternator API enabled in Scylla on each node):
1. 2-node cluster with one node upgraded: reverse read queries performed on an old node
2. 2-node cluster with one node upgraded: reverse read queries performed on a new node
3. 2-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on an old node
4. 2-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on a new node
All reverse read queries above consists of:
- single-partition reverse reads with no clustering key restrictions, with single column restrictions and multi column restrictions both with and without paging turned on
The exact same tests were also performed on a fully upgraded cluster.
Fixes https://github.com/scylladb/scylladb/issues/20191
No backport is required as this is a complementary patch for the series https://github.com/scylladb/scylladb/pull/18864 that did not require backporting.
Closesscylladb/scylladb#20205
* github.com:scylladb/scylladb:
test_query.py: Test reverse queries with clustering key bounds
alternator::do_query Add additional trace log
alternator::do_query: Use native reversed format
alternator::do_query Rename schema with table_schema
Currently, the `force` property of the `source_dc` rebuild option
is lost and `raft_topology_cmd_handler` has no way to know
if it was given or not.
This in turn can cause rebuild to fail, even when `--force`
is set by the user, where it would succeed with gossip
topology changes, based on the source_dc --force semantics.
Fixesscylladb/scylladb#20242
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20249
* seastar a7d81328...83e6cdfd (29):
> fair_queue: Export the number of times class was activated
> tests/unit: drop support of C++17
> remove vestigial OSv support
> cmake: undefine _FORTIFY_SOURCE on thread.cc
> container_perf: a benchmark for container perf
> io_sink: use chunked_fifo as _pending_io container
> chunked_fifo: implement clear in terms of pop_n
> chunked_fifo: pop_front_n
> io_sink: use iteration instead of indexing
> json2code_test: choose less popular port number
> ioinfo: add '--max-reqsize' parameter
> treewide: drop the support of fmtlib < 8.0.0
> build: bump up the required fmtlib version to 8.1.1
> conditional-variable: align when() and wait() behaviour in case of a predicate throwing an exception
> stall-analyser: add output support for flamegraph
> reactor: Add --io-completion-notify-ms option
> io_queue: Stall detector
> io_queue: Keep local variable with request execution delay
> io_queue: Rename flow ratio timer to be more generic
> reactor: Export _polls counter (internally)
> dns: de-inline dns_resolver::impl methods
> dns: enter seastar::net namespace
> dnf: drop compatibility for c-ares <= 1.16
> reactor: add missing includes of noncopyable_function.hh
> reactor: Reset one-shot signal to DFL before handling
> future: correctly document nested exception type emitted by finally()
> modules: fix FATAL_ERROR on compiler check
> seastar.cc: include fmt/ranges.h
> pack io_request
Closesscylladb/scylladb#20300
`dist-check` tests the generated rpm packages by installing them in a centos 7 container. but this script is terribly outdated
- centos 7 is deprecated. we should use a new distro's latest stable release.
- cqlsh was added to the family of rpms a while ago. we should test it as well.
- the directory hierarchy has been changed. we should read the artifacts from the new directories.
- cmake uses a different directory hierarchy. we should check the directory used by cmake as well.
to address these breaking changes, the scripts are updated accordingly.
---
this change gives an overhaul to a test, which is not used in production. so no need to backport.
Closesscylladb/scylladb#20267
* github.com:scylladb/scylladb:
tools/testing: add cqlsh rpm
tools/testing: adapt to cmake build directory
tools/testing: test with rockylinux:9 not centos:7
tools/testing: correct the paths to rpm packages and SCYLLA-*-FILE
dist-check: add :z option when mapping volume
The storage_manager maintains set of clients to configured object
storage(s). The sstables loader is going to spawn tasks that will talk
to to those storages, thus it needs the storage manager to get the
clients clients from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service is going to start tasks managed by task manager. For that,
it should have its module set up and registered.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will need this method to initialize sstable_directory
differently and then do its regular processing. For that, split the
method into two, next patch will re-use the common part it needs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current S3 storage driver keeps sstables in bucket in a form of
/bucket/generation/component-name
To get sstables that are backed up on S3 this format doesn't apply,
because components are uploaded with their names unmodified. This patch
makes S3 storage driver account for that and not re-format component
paths for upload sstable state.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When component lister is created it checks the target storage options
for what kind of lister to create. For local options it creates FS
lister that collects sstables from their component files. For S3
options, it relies on sstables registry.
When collecting sstables from backup, it's not possible to use registry,
because those entries are not there. Instead, lister should pick up
individual components as it they were on local FS. This patch prepares
the lister for that -- in case S3 options are provided and the sstables'
state is "upload", don't try to read those from registry, but
instantiate the FS lister that will later use s3::bucket_lister.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When sstable directory collects a entry from storage, it tries to parse
its full path with the help of sstables::parse_path(). There are two
overloads of that function -- one with ks:cf arguments and one without.
The latter tries to "guess" keyspace and table names from the directory
name.
However, ks and table names are already known by the directory, it
doesn't even use the returned ks and cf values, so this parsing is
excessive. Also, future patches will put here backup paths, that might
not match the ks_name/table_name-table_uuid/ pattern that the parser
expects.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Directory lister comes with a filter function that tells lister which
entries to skip by its .get() method. For uniformity, add the same to
S3 bucket_lister.
After this change the lister reports shorter name in the returned
directory entry (with the prefix cut), so also need to tune up the unit
test respectively.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch hides directory_lister and bucket_lister behind a common
facade. The intention is to provide a uniform API for sstable_directory
that it could use to list sstables' components wherever they are.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, when a view update backlog of one replica is full, the write is still sent by the coordinator to all replicas. Because of the backlog, the write fails on the replica, causing inconsistency that needs to be fixed by repair. To avoid these inconsistencies, this patch adds a check on the coordinator for overloaded replicas. As a result, a write may be rejected before being sent to any replicas and later retried by the user, when the replica is no longer overloaded.
This patch does not remove the replica write failures, because we still may reach a full backlog when more view updates are generated after the coordinator check is performed and before the write reaches the replica.
Fixesscylladb/scylladb#17426Closesscylladb/scylladb#18334
* github.com:scylladb/scylladb:
mv: test the view update behavior
mv: add test for admission control
storage_proxy: return overloaded_exception instead of throwing
mv: reject user requests by coordinator when a replica is overloaded by MVs
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.
Batchlog manager cannot be initialized before repair as we have the
dependencies chain:
repair_service -> storage_service::join_cluster -> batchlog_manager.
Throw if batchlog manager isn't initialized. That won't cause repair
to fail.
This series fixes an issue where histogram Summaries return an infinite value.
It updated the quantile calculation logic to address cases where values fall into the infinite bucket of a histogram.
Now, instead of returning infinite (max int), the calculation will return the last bucket limit, ensuring finite outputs in all cases.
The series adds a test for summaries with a specific test case for this scenario.
Fixes#20255
Need backport to 6.0, 6.1 and 2023.1 and above
Closesscylladb/scylladb#20257
* github.com:scylladb/scylladb:
test/estimated_histogram_test Add summary tests
utils/histogram.hh: Make summary support inifinite bucket.
instead of using the default `argparse.HelpFormatter`, let's
use `ArgumentDefaultsHelpFormatter`, so that the default values
of options are displayed in the help messages.
this should help developer understand the behavior of the script
better.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20262
to achieve feature parity with our existing building system, we
need to implement a new build target "dist-check" in the CMake-based
building system.
in this change, "dist-check" is added to CMake-based building system.
unlike the rules generated by `configure.py`, the `dist-check` target
in CMake depends on the dist-*-rpm targets. the goal is to enable user
to test `dist-check` without explicitly building the artifacts being
tested.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20266
in 57def6f1, we specified "package-mode" for poetry, but this option
was introduced in poetry 1.8.0, as the "non-package" mode support.
see https://github.com/python-poetry/poetry/releases/tag/1.8.0
this change practically bumps up the minimum required poetry version
to 1.8.0, we did update `pyproject.tombl` to reflect this change.
but wefailed to update the `Makefile`.
in this change, we update `Makefile` to ensure that user which happens
have an older version of poetry can install the version which supports
this version when running `make setupenv`.
Refs scylladb/scylladb#20284
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20286
we switched from `circular_buffer` to `chunked_fifo` to present
`io_sink::_pending_io` in the latest seastar now. to be prepared for
this change, let's
* add `chunked_fifo` class in `scylla-gdb.py`.
* use `circular_buffer` as a fallback of `chunked_fifo`. instead of
doing this the other way around, we try to send the message that
the latest seastar uses `chunked_fifo`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20280
Add parameter --cluster-pool-size that can control pool size for all
PythonTestSuite tests. By default, the pool size set to 10 for most of
the suites, but this is too much for laptops. So this parameter can be
used to lower the pool size and not to freeze the system. Additionally,
the environment variable CLUSTER_POOL_SIZE was added for a convenient way
to limit pool size in the system without the need to provide each time an
additional parameter.
Related: https://github.com/scylladb/scylladb/pull/20276Closesscylladb/scylladb#20289
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757Closesscylladb/scylladb#19758
* github.com:scylladb/scylladb:
abstract_replication_strategy: make get_ranges async
database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
alternator: ttl: token_ranges_owned_by_this_shard: let caller make the ranges_holder
alternator: ttl: can pass const gms::gossiper& to ranges_holder
alternator: ttl: ranges_holder_primary: unconstify _token_ranges member
alternator: ttl: refactor token_ranges_owned_by_this_shard
Removed people that no longer contribute to the scylladb.git and added/substituted reviewers responsible for maintaining the frontend components.
No need to backport, this is just an information for the github tool.
Closesscylladb/scylladb#20136
* github.com:scylladb/scylladb:
codeowners: add appropriate reviewers to the cluster components
codeowners: add appropriate reviewers to the frontend components
codeowners: fix codeowner names
codeowners: remove non contributors
table_helper has some quite awkward code, improve it a little.
Code cleanup, so no reason to backport.
Closesscylladb/scylladb#20194
* github.com:scylladb/scylladb:
table_helper: insert(): improve indentation
table_helper: coroutinize insert()
table_helper: coroutinize cache_table_info()
table_helper: extract try_prepare()
Currently "table removal" is logged as a reason of compaction stop for table drop,
tablet cleanup and tablet split. Modify log to reflect the reason.
Closesscylladb/scylladb#20042
* github.com:scylladb/scylladb:
test: add test to check compaction stop log
compaction: fix compaction group stop reason
we need to test the installation of cqlsh rpm. also, we should use the
correct paths of the generated rpm packages.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
cmake uses a different arrangement, so let's check for the existence
of the build directory and fallback to cmake's build directory.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the centos image repos on docker has been deprecated, and the
repo for centos7 has been removed from the main CentOS servers.
so we are either not able to install packages from its default repo,
without using the vault mirror, or no longer to pull its image
from dockerhub.
so, in this change
* we switch over to rockylinux:9, which is the latest stable release
of rockylinux, and rockylinux is a popular clone of RHEL, so it
matches our expectation of a typical use case of scylla.
* use dnf to manage the packages. as dnf is the standard way to manage
rpm packages in modern RPM-based distributions.
* do not install deltarpm.
delta rpms are was not supported since RHEL8, and the `deltarpm` package
is not longer available ever since. see
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html-single/considerations_in_adopting_rhel_8/index#ref_the-deltarpm-functionality-is-no-longer-supported_notable-changes-to-the-yum-stack
as a sequence, this package does not exist in Rockylinux-9.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
when building with the rules generated from `configure.py`,
these files are located under tools' own build directory.
so correct them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
if SELinux is enabled on the host, we'd have following failure when
running `dist-check.sh`:
```
+ podman run -i --rm -v /home/kefu/dev/scylladb:/home/kefu/dev/scylladb docker.io/centos:7 /bin/bash -c 'cd /home/kefu/dev/scylladb && /home/kefu/dev/scylladb/tools/testing/dist-check/docker.io/centos-7.sh --mode debug'
/bin/bash: line 0: cd: /home/kefu/dev/scylladb: Permission denied
```
to address the permission issue, we need to instruct podman to
relabel the shared volume, so that the container can access
the shared volume.
see also https://docs.podman.io/en/stable/markdown/podman-pod-create.1.html#volume-v-source-volume-host-dir-container-dir-options
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, none of the target generated by CMake-based
building system runs `test.py`. but `build.ninja` generated directly
by `configure.py` provides a target named `test`, which runs the
`test.py` with the options passed to `configure.py`.
to be more compatible with the rules generated by `configure.py`,
in this change
* do not include "CTest" module, as we are not using CTest for
driving tests. we use the homebrew `test.py` for this purpose.
more importantly, the target named "test" is provided by "CTest".
so in order to add our own "test" target, we cannot use "CTest"
module.
* add a target named "test" to run "test.py".
* add two CMake options so we can customize the behavior of "test.py",
this is to be compatible with the existing behavior of `configure.py`.
Refs scylladb/scylladb#2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20263
This adds minimal implementation of the start-backup API call.
The method starts a task that uploads all files from the given keyspace's snapshot to the requested endpoint/bucket. Arguments are:
- endpoint -- the ID in object_store.yaml config file
- bucket -- the target bucket to put objects into
- keyspace -- the keyspace to work on
- snapshot -- the method assumes that the snapshot had been already taken and only copies sstables from it
The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion (hint: it's good to have non-zero TTL value to make sure fast backups don't finish before the caller manages to call wait_task API).
Sstables components are scanned for all tables in the keyspace and are uploaded into the /bucket/${cf_name}/${snapshot_name}/ path.
refs: #18391Closesscylladb/scylladb#19890
* github.com:scylladb/scylladb:
tools/scylla-nodetool: add backup integration
docs: Document the new backup method
test/object_store: Test that backup task is abortable
test/object_store: Add simple backup test
test/object_store: Move format_tuples()
test/pylib: Add more methods to rest client
backup-task: Make it abortable (almost)
code: Introduce backup API method
database: Export parse_table_directory_name() helper
database: Introduce format_table_directory_name() helper
snapshot-ctl: Add config to snapshot_ctl
snapshot-ctl: Add sstables::storage_manager dependency
snapshot-ctl: Maintain task manager module
snapshot-ctl: Add "snapshots" logger
snapshot-ctl: Outline stop() method and constructor
snapshot-ctl: Inline run_snapshot_list<>
test/cql_test_env: Export task manager from cql test env
task_manager: Print task ttl on start (for debugging)
docs: Update object_storage.md with AWS_ environment
docs: Restructure object_storage.md
since the rules generated by `configure.py` has this target, we need
to have an equivalent target as well in CMake-based buidling system.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20265
Increase pool size changes were recently reverted because of the flakiness for the test_gossip_boot test. Test started
to fail on adding the node to the cluster without any issues in the Scylla log file. In test logs it looked like the
installation process for the new node just hanged. After investigating the problem, I've found out that the issue is that
test.py was draining the io_executor pool for cleaning the directory during install that was set to eight workers. So
to fix the issue, io_executor pool should be increased to more or less the same ratio as it was: doubled cluster pool size.
Closesscylladb/scylladb#20276
so that we can use {fmt} with it without the help of fmt::streamed.
also since we have a proper formatter for replication_strategy_type,
let's implement
`formatter<vnode_effective_replication_map::factory_key>`
as well.
since there are no callers of these two operator<<, let's drop
them in this change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20248
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for making the function async.
Then, it will need to hold on to the erm while getting
the token_ranges asynchronously.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is used only here and can be simplified by
checking if the keyspace replication strategy
is per table by the caller.
Prepare for making get_keyspace_local_ranges async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add static `make` methods to ranges_holder_{primary,secondary}
and use them to make the ranges objects and pass them
to `token_ranges_owned_by_this_shard`, rather than letting
token_ranges_owned_by_this_shard invoke the right constructor
of the ranges_holder class.
Prepare for making `make` async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than holding a variant member (and defining
both ranges_holder_{primary,secondary} in both
specilizations of the class, just make the internal
ranges_holder class first-class citizens
and parameterize the `token_ranges_owned_by_this_shard`
template by this class type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
table_helper::cache_table_info() is fairly convoluted. It cannot be
easily coroutinized since it invokes asynchronous functions in a
catch block, which isn't supported in coroutines. To start to break it
down, extract a block try_prepare() from code that is called twice. It's
both a simplification and a first step towards coroutinization.
The new try_prepare() can return three values: `true` if it succeeded,
`false` if it failed and there's the possibility of attempting a fallback,
and an exception on error.
The `keyspace_compaction` method incorrectly appends the column family
parameter to the URL using a regular string, `"?cf={table}"`, instead of
an f-string, `f"?cf={table}"`. As a result, the column family name is
sent as `{table}` to the server, causing the compaction request to fail.
Fix this issue by passing the parameter to the POST request using a
dictionary instead of appending it to the URL.
Fixes#20264
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20243
before this change, we use the default options when creating `test_env`,
and the default options enable `use_uuid`. but the modes of
`perf-sstables` involving reads assumes that the identifiers are
deterministic. so that the previously written sstables using the "write"
mode can be read with the modes like "index_read", which just uses
`test_env::make_sstable()` in `load_sstables()`, and under the hood,
`test_env::make_sstable()` uses `test_env::new_generation()` for
retrieving the next identifier of sstable. when using integer-base
identifier, this works. as the sstable identifiers are generated
from a monotonically increasing integer sequence, where the identifiers
are deterministic. but this does not apply anymore when the UUID-based
identifiers are used, as the identifiers are generated with a
pseudorandom generator of UUID v1.
in this change, to avoid relying on the determinism of the integer-based
sstable identifier generation, we enumerate sstables by listing the
given directory, and parse the path for their identifier.
after this change, we are able to support the UUID-based sstable
identifier.
another option is disable the UUID-based sstable identifier when
loading sstables. the upside is that this approach is minimal and
straightforward. but the downside is that it encodes the assumption
in the algorithm implicitly, and could be confusing -- we create
a new generation for loading an existing sstable with this generation.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20183
Now the endpoint hanler gets the value from db::config which is not nice
from several perspectives. First, it gets config (ab)using database.
Second, it's compaction manager that "knows" its throughput, global
config is the initial source of that information.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20173
Currently the major compaction task impl grabs this (non-updateable)
value from db::config. That's not good, all services including
compaction manager have their own configs from which they take options.
Said that, this patch puts the said option onto
compaction_manager::config, makes use of it and configures one from
db::config on start (and tests).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20174
This is prerequisite for "restore from object storage" feature. In order to collect the sstables in bucket one would need to list the bucket contents with the given prefix. The ListObjectsV2 provides a way for it and here's the respective s3::client extension.
Closesscylladb/scylladb#20120
* github.com:scylladb/scylladb:
test: Add test for s3::client::bucket_lister
s3_client: Add bucket lister
s3_client: Encode query parameter value for query-string
in 947e2814, we pass `--tty` as long as we are using podman _or_
we are in interactive mode. but if we build the tree using podman
using jenkins, we are seeing that ninja is displaying the output
as if it's in an interactive mode. and the output includes ASCII
escape codes. this is distracting.
the reason is that we
* are using podman, and
* ninja tells if it should displaying with a "smart" terminal by
checking istty() and the "TERM" environmental variable.
so, in this change, we add --tty only if
* we are in the interactive mode.
* or stdin is associated with a terminal. this is the use case
where user uses dbuild to interactively build scylla
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20196
in 30e82a81, we add a contraint to the template parameter of
boost_test_print_type() to prevent it from being matched with
types which can be formatted with operator<<. but it failed to
work. we still have test failure reports like:
```
[Exception] - critical check ['s', 's', 't', '_', 'm', 'r', '.', 'i', 's', '_', 'e', 'n', 'd', '_', 'o', 'f', '_', 's', 't', 'r', 'e', 'a', 'm', '(', ')'] has failed
```
this is not what we expect. the reason is that we passed the template
parameters to the `has_left_shift` trait in the wrong order, see
https://live.boost.org/doc/libs/1_83_0/libs/type_traits/doc/html/boost_typetraits/reference/has_left_shift.html.
we should have passed the lhs of operator<< expression as first
parameter, and rhs the second.
so, in this change, we correct the type constraint by passing the
template parameter in the right order, now the error message looks
better, like:
```
test/boost/mutation_query_test.cc(110): error: in "test_partition_query_is_full": check !partition_slice_builder(*s) .with_range({}) .build() .is_full() has failed
```
it turns out boost::transformed_range<> is formattable with operator<<,
as it fulfills the constraints of `boost::has_left_shift<ostream, R>`,
but when printing it, the compiler fails when it tries to insert the
elements in the range to the output stream.
so, in order to workaround this issue, we add a specialization for
`boost::transformed_range<F, R`.
also, to improve the readability, we reimplement the `has_left_shift<>`
as a concept, so that it's obvious that we need to put both the output
stream as the first parameter.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20233
This patch adds tests for summary calculation. It adds two tests, the
first is a basic calculation for P50, P95, P99 by adding 100 elements
into 20 buckets.
The second test look that if elements are found in the infinite bucket,
the result would be the lower limit (33s) and not infinite.
Relates to #20255
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch handles an edge cases related to The infinite bucket
limit.
Summaries are the P50, P95, and P99 quantiles.
The quantiles are calculated from a histogram; we find the bucket and
return its upper limit.
In classic histograms, there is a notion of the infinite bucket;
anything that does not fall into the last bucket is considered to be
infinite;
with quantile, it does not make sense. So instead of reporting infinite
we'll report the bucket lower limit.
Fixes#20255
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
as we have an API for backup a keyspace, let's expose this feature
with nodetool. so we can exercise it without the help of scylla-manager
or 3rd-party tools with a user-friendly interface.
in this change:
* add a new subcommand named "backup" to nodetool
* add test to verify its interaction with the API server
* add two more route to the REST API mock server, as
the test is using /task_manager/wait_task/{task_id} API.
for the sake of completeness, the route for
/task_manager/{part1} is added as well.
* update the document accordingly.
* the bash completion script is updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
It starts similarly to simpl backup test, but injects a pause into the
task once a single file is scheduled for upload, then aborts the task,
waits for it to fail, and check that _not_ all files are uploaded.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test shows how to backup a keyspace:
- flush
- take snapshot
- start backup with the new API method
- wait for the task to finish
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Namely:
- POST /storage_service/snapshots to take snapshot on a ks
- GET /task_manager/get_task_status/{id} to get status of a running task
- GET /task_manager/wait_task/{id} to wait for a task to finish
- POST /task_manager/abort_task/{id} to abort a running task
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make the impl::is_abortable() return 'yes' and check the impl::_as in
the files listing loop. It's not real abort, since files listing loop is
expected to be fast and most of the time will be spent in s3::client
code reading data from disk and sending them to S3, but client doesn't
support aborting its requests. That's some work yet to be done.
Also add injection for future testing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method starts a task that uploads all files from the given
keyspace's snapshot to the requested endpoint/bucket. The task runs in
the background, its task_id is returned from the method once it's
spawned and it should be used via /task_manager API to track the task
execution and completion (hint: it's good to have non-zero TTL value to
make sure fast backups don't finish before the caller manages to call
wait_task API).
If snapshot doesn't exist, nothing happens (FIXME, need to return back
an error in that case).
If endpoint is not configured locally, the API call resolves with
bad-request instantly.
Sstables components are scanned for all tables in the keyspace and are
uploaded into the /bucket/${cf_name}/${snapshot_name}/ path.
Task is not abortable (FIXME -- to be added) and doesn't really report
its progress other than running/done state (FIXME -- to be added too).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's parse_table_directory_name() static helper in database.cc code
that is used by methods that parse table tree layout for snapshot.
Export this helper for external usage and rename to fit the format_...
one introduced by previous patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one makes table directory (not full path) out of table name and
uuid. This is to be symmetrical with yet another helper that converts
dirctory name back to table name and uuid (next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Pretty much all services in Scylla have their own config. Add one to
snapshot-ctl too, it will be populated later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage_manager maintains set of clients to configured object
storage(s). The snapshot ctl is going to spawn tasks that will talk to
those storages, thus it needs the storage manager to get the clients
from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service is going to start tasks managed by task manager. For that,
it should have its module set up and registered.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper will be used by a code from another .cc file, so the
template needs to be in header for smooth instantiation
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the doc assumes that object storage can only be used to keep
sstables on it. It's going to change, restructure the doc to allow for
more usage scenarios.
Currently it doesn't, one of the node crashes with std::out_of_range
exception and meaningless calltrace
[Botond]: this test checks the case of reading a partition via
MUTATION_FRAGMENTS from a node which doesn't own said partition.
refs: #18786
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
By making it a required argument, making sure the topology version is
pinned for the duration of the query. This is needed because mutation
dump queries bypass the storage proxy, where this pinning usually takes
place. So it has to be enforced here.
When using tablets, the replica-side doesn't handle un-owned tokens.
table::shard_for_reads() will just return 0 for un-owned tokens, and a
later attempt at calling table::storage_group_for_token() with said
un-owned token will cause a crash (std::terminate due to
std::out_of_range thrown in noexcept context).
The replicas rely on the coordinator to not send stray requests, but for
select from mutation_fragments(table) queries, there is no coordinator
side who could do the correct dispatching. So do this in
mutation_dump(), just creating empty readers for un-owned tokens.
Since a native reversed format is used for reversed queries,
additional tests with restrictions on clustering keys are required
to capture possible errors like https://github.com/scylladb/scylladb/issues/20191
earlier than in dtests.
Add parametrization to the following tests:
+ test_query_reverse
+ test_query_reverse_paging
to accept a comparison operator used in selection criteria for a Query
operation.
compaction_manager::remove passes "table removal" as a reason
of stopping ongoing compactions, but currently remove method
is also called when a tablet is migrated or split.
Pass the actual reason of compaction stop, so that logs aren't
misleading.
This reverts commit cc428e8a36. It causes
may spurious CI failures while nodes are being torn down. Revert it until
the root cause is fixed, after which it can be reinstated.
Fixes#20116.
Now, when each shard storage_group_manager keeps
only the storage_groups for the tablet replica it owns,
we can simple return the storage_group map size
instead of counting the number of tablet replicas
mapped to this shard.
Add a unit test that sums the tablet count
on all shards and tests that the sum is equal
to the configured default `initial_tablets.
Fixes#18909
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20223
…utations vector
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls when freed.
For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
(inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
(inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```
This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time, and by that, reducing memory footprint and preventing reactor stalls.
Fixes#18173Closesscylladb/scylladb#18174
* github.com:scylladb/scylladb:
schema_tables: calculate_schema_digest: filter the key earlier
schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector
Additional log prints information on the read query being executed.
It lists information like whether the query is a reversed one or
not, and table_schema and query_schema versions.
When executing reversed queries, a native revered format shall be used.
Therefore the table schema and the clustering key bounds are reversed
before a partition slice and a read command are constructed.
Similarly as for cql3::statements::select_statement.
In order to increase readability, a schema variable is renamed to
a table_schema to emphesize a table schema is passed to the function
and used across it.
Allows us to introduce a query_schema variable in the next patch.
Currently, database::tables_metadata::add_table needs to hold a write
lock before adding a table. So, if we update other classes keeping
track of tables before calling add_table, and the method yields,
table's metadata will be inconsistent.
Set all table-related info in tables_metadata::add_table_helper (called
by add_table) so that the operation is atomic.
Analogically for remove_table.
Fixes: #19833.
Closesscylladb/scylladb#20064
io_fiber/store_snapshot_descriptor now gets the actual number of items
preserved when the log is truncated, fixing extra entries remained after
log snapshot creation. Also removes incorrect check for the number of
truncated items in the
raft_sys_table_storage::store_snapshot_descriptor.
Minor change: Added error_injection test API for changing snapshot thresholds settings.
Fixesscylladb/scylladb#16817Fixesscylladb/scylladb#20080Closesscylladb/scylladb#20095
* github.com:scylladb/scylladb:
raft: Ensure const correctness in applier_fiber.
raft: Invoke store_snapshot_descriptor with actually preserved items.
raft: Use raft_server_set_snapshot_thresholds in tests.
raft: Fix indentation in server.cc
raft: Add a test to check log size after truncation.
raft: Add raft_server_set_snapshot_thresholds injection.
utils: Ensure const correctness of injection_handler::get().
It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored.
Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology`
strategies, as with simple replication strategy
there is no guarantee that there would be any
more replicas in that data center.
Fixes#16826
Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865
It fails without this fix and passes with it.
* Requires backport to live versions. Issue hit in the filed with 2022.2.14
Closesscylladb/scylladb#16827
* github.com:scylladb/scylladb:
repair: do_rebuild_replace_with_repair: use source_dc only when safe
repair: replace_with_repair: pass the replace_node downstream
repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
repair: replace_rebuild_with_repair: pass ks_erms from caller
nodetool: rebuild: add force option
Add and use utils::optional_param to pass source_dc
- raft_sys_table_storage::store_snapshot_descriptor now receives a number of
preserved items in the log, rather than _config.snapshot_trailing value;
- Incorrect check for truncated number of items in store_snapshot_descriptor
was removed.
Fixesscylladb/scylladb#16817Fixesscylladb/scylladb#20080
Replace raft_server_snapshot_reduce_threshold with raft_server_set_snapshot_thresholds in tests
as raft_server_set_snapshot_thresholds fully covers the functionality of raft_server_snapshot_reduce_threshold.
before this change, `scylla sstable shard-of` didn't support tablets,
because:
- with tablets enabled, data distribution uses the scheduler
- this replaces the previous method of mapping based on vnodes and shard numbers
- as a result, we can no longer deduce sstable mapping from token ranges
in this change, we:
- read `system.tablets` table to retrieve tablet information
- print the tablet's replica set (list of <host, shard> pairs)
- this helps users determine where a given sstable is hosted
This approach provides the closest equivalent functionality of
`shard-of` in the tablet era.
Fixesscylladb/scylladb#16488
---
no need to backport, it's an improvement, not a critical fix.
Closesscylladb/scylladb#20002
* github.com:scylladb/scylladb:
tools: enhance `scylla sstable shard-of` to support tablets
replica/tablets: extract tablet_replica_set_from_cell()
tools: extract get_table_directory() out
tools: extract read_mutation out
build: split the list of source file across multiple line
tools/scylla-sstable: print warning when running shard-of with tablets
The include flag directive now treats missing content as info logs instead of warnings. This prevents build failures when the enterprise-specific content isn't yet available.
If the enterprise content is undefined, the directive automatically loads the open-source content. This ensures the end user has access to some content.
address comments
Closesscylladb/scylladb#19804
in 5ce07e5d84, the target named "storage_proxy.o" was added for
training the build of clang. but the rule for building this target
has two flaws:
* it was added a dependency of the "all" target, but we don't need
to build `storage_proxy.cc` twice when building the tree in the
regular build job. we only need to build it when creating the
profile for training the build of clang.
* it misses the include directory of abseil library. that's why we
have following build failure when building the default target:
```
[2024-08-18T14:58:37.494Z] /usr/local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/jenkins/workspace/scylla-master/scylla-ci/scylla -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/seastar/gen/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/seastar/gen/src -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/gen -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT service/CMakeFiles/storage_proxy.o.dir/Debug/storage_proxy.cc.o -MF service/CMakeFiles/storage_proxy.o.dir/Debug/storage_proxy.cc.o.d -o service/CMakeFiles/storage_proxy.o.dir/Debug/storage_proxy.cc.o -c /jenkins/workspace/scylla-master/scylla-ci/scylla/service/storage_proxy.cc
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/service/storage_proxy.cc:17:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/db/commitlog/commitlog.hh:19:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/db/commitlog/commitlog_entry.hh:15:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/mutation/frozen_mutation.hh:15:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/mutation/mutation_partition_view.hh:16:
[2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/build/gen/idl/mutation.dist.impl.hh:14:
[2024-08-18T14:58:37.495Z] /jenkins/workspace/scylla-master/scylla-ci/scylla/serializer_impl.hh:20:10: fatal error: 'absl/container/btree_set.h' file not found
[2024-08-18T14:58:37.495Z] 20 | #include <absl/container/btree_set.h>
[2024-08-18T14:58:37.495Z] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2024-08-18T14:58:37.495Z] 1 error generated.
```
* if user only enables "dev" mode, we'd have:
```
CMake Error at service/CMakeLists.txt:54 (add_library):
No SOURCES given to target: storage_proxy.o
```
so, in this change, we
* exclude this target from "all"
* link this target against abseil header library, so it has access
to the abseil library. please note, we don't need to build an
executable in this case, so the header would suffice.
* add a proxy target to conditionally enable/disable this target.
as CMake does not support generator expression in `add_dependencies()`
yet at the time of writing.
see https://gitlab.kitware.com/cmake/cmake/-/issues/19467
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20195
~~~
utils/tagged_integer: remove conversion to underlying integer
Silently converting a tagged (i.e., "dimension-ful") integer to a naked
("dimensionless") integer defeats the purpose of having tagged integers,
and is a source of practical bugs, such as
<https://github.com/scylladb/scylladb/issues/20080>.
We could make the conversion operator explicit, for enforcing
static_cast<TAGGED_INTEGER_TYPE::value_type>(TAGGED_INTEGER_VALUE)
in every conversion location -- but that's a mouthful to write. Instead,
remove the conversion operator, and let clients call the (identically
behaving) value() member function.
~~~
No backport needed (refactoring).
The series is supposed to solve #20081.
Two patches in the series touch up code that is known to be (orthogonally) buggy; see
- `service/raft_sys_table_storage: tweak dead code` (#20080)
- `test/raft/replication: untag index_t in test_case::get_first_val()` (#20151)
Fixes for those (independent) issues will have to be rebased on this series, or this series will have to be rebased on those (due to context conflicts).
The series builds at every stage. The debug and release unit test suites pass at the end.
Closesscylladb/scylladb#20159
* github.com:scylladb/scylladb:
utils/tagged_integer: remove conversion to underlying integer
test/raft/randomized_nemesis_test: clean up remaining index_t usage
test/raft/randomized_nemesis_test: clean up index_t usage in store_snapshot()
test/raft/replication: clean up remaining index_t usage
test/raft/replication: take an "index_t start_idx" in create_log()
test/raft/replication: untag index_t in test_case::get_first_val()
test/raft/etcd_test: tag index_t and term_t for comparisons and subtractions
test/raft/fsm_test: tag index_t and term_t for comparisons and subtractions
test/raft/helpers: tighten compare_log_entries() param types
service/raft_sys_table_storage: tweak dead code
service/raft_sys_table_storage: simplify (snap.idx - preserve_log_entries)
service/raft_sys_table_storage: untag index_t and term_t for queries
raft/server: clean up index_t usage
raft/tracker: don't drop out of index_t space for subtraction
raft/fsm: clean up index_t and term_t usage
raft/log: clean up index_t usage
db/system_keyspace: promise a tagged integer from increment_and_get_generation()
gms/gossiper: return "strong_ordering" from compare_endpoint_startup()
gms/gossiper: get "int32_t" value of "gms::version_type" explicitly
It is unsafe to restrict the sync nodes for repair to
the source data center if we cannot guarantee a quorum
in the data center with network-topology replication strategy.
This change restricts the usage of source_dc in the following cases:
1. For SimpleStrategy - source_dc is ignored since there is no guarantee
that it contains remaining replicas for all tokens.
2. For EverywhereStrategy - use source_dc if there are remaining
live nodes in the datacenter.
3. For NetworkTopologyStrategy:
a. It is considered unsafe to use source_dc if number of nodes
lost in that DC (replaced/rebuilt node + additional ignored nodes)
is greater than 1, or it has 1 lost node and rf <= 1 in the DC.
b. If the source_dc arg is forced, as with the new
`nodetool rebuild --force <source_dc>` option,
we use it anyway, even if it's considered to be unsafe.
A warning is printed in this case.
c. If the source_dc arg is user-provided, (using nodetool rebuild),
an error exception is thrown, advising to use an alternative dc,
if available, omit source_dc to sync with all nodes, or use the
--force option to use the given source_dc anyhow.
d. Otherwise, we look for an alternative source datacenter,
that has not lost any node. If such datacenter is found
we use it as source_dc for the keyspace, and log a warning.
e. If no alternative dc is found (and source_dc is implicit), then:
log a warning and fall back to using replicas from all nodes in the cluster.
Fixes#16826
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The callers already pass ignore_nodes as host_id:s
and we translate them into inet_address only for repair
so delay the translation as much as posible,
Refs scylladb/scylladb#6403
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The keyspaces replication maps must be in sync with the
token_metadata_ptr passed already to the functions,
so instead of getting it in the callee, let the caller
get the ks_erms along with retrieving the tmptr.
Note that it's already done on the rebuild path
for streaming based rebuild.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used to force usage of source_dc, even
when it is unsafe for rebuild.
Update docs and add test/nodetool/test_rebuild.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clearly indicate if a source_dc is provided,
and if so, was it explicitly given by the user,
or was implicitly selected by scylla.
This will become useful in the next patches
that will use that to either reject the operation
if it's unsafe to use the source_dc and the dc was
explicitly given by the user, or whether
to fallback to using all nodes otherwise.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This commit extracts the information about the default for tables in keyspace creation
to a separate file in the _common folder. The file is then included using
the scylladb_include_flag directive.
The purpose of this commit is to make it possible to include a different file
in the scylla-enterprise repo - with a different default.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585Closesscylladb/scylladb#20181
would be more helpful, if the output of "--help" command line can
include the default value of options.
so, in this change, we include the default values in it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20170
before this change, we check for the existence of "TestSuite" node
under the root of XML tree, and then enumerating all "TestSuite" nodes
under this "TestSuite", this approach works. but it
* introduces unnecessary indent
* is not very readable
in this change, we just use "./TestSuite/TestSuite" for enumerating
all "TestSuite" nodes under "TestSuite". simpler this way.
---
it's a cleanup in the test driver script, hence no need to backport.
Closesscylladb/scylladb#20169
* github.com:scylladb/scylladb:
test.py: fix the indent
test.py: use XPath for iterating in "TestSuite/TestSuite"
Alternator already supports **authentication** - the ability to to sign each request as a particular user. The users that can be used are the different "roles" that are created by CQL "CREATE ROLE" commands. This series adds support for **authorization**, i.e., the ability to determine that only some of these roles are allowed to read or write particular tables, to create new tables, and so on.
The way we chose to do this in this series is to support CQL's existing role-based access control (RBAC) commands - GRANT and REVOKE - on Alternator tables. For example, an Alternator table "xyz" is visible to CQL as "alternator_xyz.xyz", so a `GRANT SELECT ON alternator_xyz.xyz TO myrole` will allow read commands (e.g., GetItem) on that table, and without this GRANT, a GetItem will fail with `AccessDeniedException`.
This series adds the necessary checks to all relevant Alternator operations, and also adds extensive functional testing for this feature - i.e., that certain DynamoDB API operations are not allowed without the appropriate GRANTs.
The following permissions are needed for the following Alternator API operations:
* **SELECT**: `GetItem`, `Query`, `Scan`, `BatchGetItem`, `GetRecords`
* **MODIFY**: `PutItem`, `DeleteItem`, `UpdateItem`, `BatchWriteItem`
* **CREATE**: `CreateTable`
* **DROP**: `DeleteTable`
* **ALTER**: `UpdateTable`, `TagResource`, `UntagResource`, `UpdateTimeToLive`
* _none needed_: `ListTables`, `DescribeTable`, `DescribeEndpoints`, `ListTagsOfResource`, `DescribeTimeToLive`, `DescribeContinuousBackups`, `ListStreams`, `DescribeStream`, `GetShardIterator`
Currently, I decided that for consistency each operation requires one permission only. For example, PutItem only requires MODIFY permission. This is despite the fact that in some cases (namely, `ReturnValues=ALL_OLD`) it can also _read_ the item. We should perhaps discuss this decision - and compare how it was done in CQL - e.g., what happens in LWT writes that may return old values?
Different permissions can be granted for a base table, each of its views, and the CDC table (Alternator streams). This adds power - e.g., we can allow a role to read only a view but not the base table, or read the table but not its history. GRANTing permissions on views or CDC logs require knowing their names, which are somewhat ugly (e.g., the name of GSI "abc" in table "xyz" is `alternator_xyz.xyz:abc`). But usefully, the error message when permissions are denied contains the full name of the table that was lacking permissions and which permissions were lacking, so users can easily add them.
In addition to permissions checking, this series also correctly supports _auto-grant_ (except #19798): When a role has permissions to `CreateTable`, any table it creates will automatically be granted all permissions for this role, so this role will be able to use the new table and eventually delete it. `DeleteTable` does the opposite - it removes permissions from tables being deleted, so that if later a second user re-creates a table with the same name, the first user will not have permissions over the new table.
The already-existing configuration parameter `alternator_enforce_authorization` (off by default), which previously only enabled authentication, now also enables authorization. Users that upgrade to the new version and already had `alternator_enforce_authorization=true` should verify that the users they use to authenticate either have the appropriate permissions or the "superuser" flag. Roles used to authenticate must also have the "login" flag.
Please note that although the new RBAC support implements the access control feature we asked for in #5047, this implementation is _not compatible_ with DynamoDB. In DynamoDB, the access control is configured through IAM operations or through the new `PutResourcePolicy` - operation, not through CQL (obviously!). DynamoDB also offers finer access-control granularity than we support (Scylla's RBAC works on entire tables, DynamoDB allows setting permissions on key prefixes, on individual attributes, and more). Despite this non-compatibility, I believe this feature, as is, will already be useful to Alternator users.
Fixes#5047 (after closing that issue, a new clean issue should be opened about the DynamoDB-compatible APIs that we didn't do - just so we remember this wasn't done yet).
New feature, should not be backported.
Closesscylladb/scylladb#20135
* github.com:scylladb/scylladb:
tests: disable test_alternator_enforce_authorization_true
test, alternator: test for alternator_enforce_authorization config
test/pylib: allow setting driver_connect() options in servers_add()
test: fix test_localnodes_joining_nodes
alternator, RBAC: reproducer for missing CDC auto-grant
alternator: document the new RBAC support
alternator: add RBAC enforcement to GetRecords
test/alternator: additional tests for RBAC
test/alternator: reduce permissions-validity-in-ms
test/alternator: add test for BatchGetItem from multiple tables
alternator: test for operations that do not need any permissions
alternator: add RBAC enforcement to UpdateTimeToLive
alternator: add RBAC enforcement to TagResource and UntagResource
alternator: add RBAC enforcement to BatchGetItem
alternator: add RBAC enforcement to BatchWriteItem
alternator: add RBAC enforcement to UpdateTable
alternator: add RBAC enforcement to Query and Scan
alternator: add RBAC enforcement to CreateTable
alternator: add RBAC enforcement to DeleteTable
alternator: add RBAC enforcement to UpdateItem
alternator: add RBAC enforcement to DeleteItem
alternator: add RBAC enforcement to PutItem
alternator: add RBAC enforcement to GetItem
alternator: stop using an "internal" client_state
Consider the following:
```
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
```
If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
**Please replace this line with justification for the backport/\* labels added to this PR**
Closesscylladb/scylladb#19427
* github.com:scylladb/scylladb:
tablets: Fix race between repair and split
compaction: Allow "offline" sstable to be split
The test is flaky and needs to be fixed in order to not randomly break
our CI, OTOH can be commented out for the time being, so that we can
marge the feature.
This patch adds tests that demonstrates the current way that Alternator's
authentication and authorization are both enabled or disabled by the
option "alternator_enforce_authorization".
If in the future we decide to change this option or eliminate it (e.g.,
remain just with the "authenticator" and "authorizer" options), we can
easily update these tests to fit the new configuration parameters and
check they work as expected.
Because the new tests want to start Scylla instances with different
configuration parameters, they are written in the the "topology"
framework and not in the test/alternator framework. The test/alternator
framework still contains (test/alternator/test_cql_rbac.py) the vast
majority of the functional testing of the RBAC feature where all those
tests just assume that RBAC is enabled and needs to be tested.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The manager.driver_connect() functions allows to pass parameters when
creating the connection (e.g., a special auth_provider), but unfortunately
right now the servers_add() function always calls driver_connect()
without parameters. So in this patch we just add a new optional
parameter to servers_add(), driver_connect_opts, that will be passed to
driver_connect().
In theory instead of the new option to driver_connect() a caller can
pass start=False to servers_add() and later call driver_connect()
manually with the right arguments. The problem is that start=False
avoids more than just calling driver_connect(), so it doesn't solve
the problem.
An example of using the new option is to run Scylla with authentication
enabled, and then connect to it using the correct default account
("cassandra"/"cassandra"):
config = {
'authenticator': 'PasswordAuthenticator',
'authorizer': 'CassandraAuthorizer'
}
servers = await manager.servers_add(1, config=config,
driver_connect_opts={'auth_provider':
PlainTextAuthProvider(username='cassandra', password='cassandra')})
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The existing test
topology_experimental_raft/test_alternator::test_localnodes_joining_nodes
Tried to create a second server but *not* wait for it to complete, but
the trick it used (cancelling the task) doesn't work since commit 2ee063c
makes a list of unwaited tasks and waits for them anyway. The test
*appears* to work because it is the last test in the file, but if we
ever add another test in the same file (like I plan to do in the next
patch), that other test will find a "BROKEN" ScyllaClusterManager and
report that it failed :-(
Other tricks I tried to use (like killing the servers) also didn't work
because of various limitations and complications of the test framework
and all its layers.
So not wanting to fight the fragile testing framework any more at this
point, I just gave up and the test will *wait* for the second server
to come up. This adds 120 seconds (!) to the test, but since this whole
test file already takes more than 500 seconds to complete, let's bite
this bullet. Maybe in the future when the test framework improves, we can
avoid this 120 second wait.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a reproducing (xfailing) test for issue #19798, which
shows that if a role is able to create an Alternator table, the role is
able to read the new table (this is known as "auto-grant"), but is NOT
able to read the CDC log (i.e., use Alternator Streams' "GetRecords").
Once we do fix this auto-grant bug, it's also important to also implement
auto-revoke - the permissions on a deleted table must be deleted as well
(otherwise the old owner of a deleted table will be able to read a new
table with the same name). This patch also adds a test verifying that
auto-revoke works. This test currently passes (because there is no auto-
grant, so nothing needs to be revoked...) but if we'll implement auto-grant
and forget auto-revoke, the second test will start to fail - so I added
this test as a precaution against a bad fix.
Refs #19798
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In docs/alternator/compatibility.md we said that although Alternator
supports authentication, it doesn't support authorization (access
control). Now it does, so the relevant text needs to be corrected
to fit what we have today.
It's still in the compatibility.md document because it's not the same
API as DynamoDB's, so users with existing applications may need to be
aware of this difference.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "SELECT" permission on a table to
run a GetRecords on it (the DynamoDB Streams API, i.e., CDC).
The grant is checked on the *CDC log table* - not on the base table,
which allows giving a role the ability to read the base but not is
change stream, or vice versa.
The operations ListStreams, DescribeStreams, GetShardIterators do not
require any permissions to run - they do not read any data, and are
(in my opinion) similar in spirit to DescribeTable, so I think it's fine
not to require any permissions for them.
A test is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Additional tests for support for CQL Role-Based Access Control (RBAC)
in Alternator:
1. Check that even in an Alternator table whose name isn't valid as CQL
table names (e.g., uses the dot character) the GRANT/REVOKE commands
work as expected.
2. Check that superuser roles have full permissions, as expected.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We set in test/cql-pytest/run.py, affecting test/alternator/run, the
configuration permissions_validity_in_ms by default to 100ms. This means
that tests that need to check how GRANT or REVOKE work always need to
sleep for more than 100ms, which can make a test with a lot of these
operations very slow.
So let's just set this configuration value to 5ms. I checked that it
doesn't adversely affect the total running speed of test/alternator/run.
This change only affects running tests through test/alternator/run, which
is expected to be fast. I left the default for test.py as it was, 100ms,
the latency of individual tests is less important there.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
While working on the RBAC on BatchGetItem, I noticed that although
BatchGetItem may ask to read items from several tables, we don't have
a test covering this case! This patch fixes that testing oversight.
Note that for the write-side version of this operation, BatchWriteItem,
we do have tests that write to several tables in the same batch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Some operations, namely ListTables, DescribeTable, DescribeEndpoints,
ListTagsOfResource, DescribeTimeToLive and DescribeContinuousBackups
do not need any permissions to be GRANTed to a role.
Our rationale for this decision is that in CQL, "describe table" and
friends also do not require any permissions.
This patch includes a test that verifies that they really don't need
permissions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "ALTER" permission on a table to
run a UpdateTimeToLive on it. UpdateTimeToLive is similar in purpose to
UpdateTable, so it makes sense to use the same permission "ALTER" as we
do for UpdateTable.
A tests is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "ALTER" permission on a table to
run the TagResource or UntagResource operations on it. CQL does not
have an exact parallel of DynamoDB's tagging feature, but our usual
use of tags as an extension of UpdateTable to change non-standard options
(e.g., write isolation policy or tablets setup), so it makes sense to
require the same permissions we require for UpdateTable - namely "ALTER".
A test for both operations is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "SELECT" permission on a table to
run a BatchGetItem on it. A single batch may ask to write to several
different tables, so we fail the entire batch with AccessDeniedException
if any of the tables mentioned in the batch do not have SELECT permissions
for this role.
A tests is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "MODIFY" permission on a table to
run a BatchWriteItem on it. A single batch may ask to write to several
different tables, so we fail the entire batch with AccessDeniedException
if any of the tables mentioned in the batch do not have MODIFY permissions
for this role.
A tests is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "ALTER" permission on a table to
run a UpdateTable on it.
A tests is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "SELECT" permission on a table to
run a Query or Scan on it.
Both Query and Scan operations call the same do_query() function, so the
permission checks are put there.
Note that Query can read from either the base table or one of its views,
and the permissions on the base and each of the views can be separate
(so we can allow a role to only read one view, for example).
Tests for all of the above are also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "CREATE" permission on ALL
KEYSPACES to run a CreateTable operation.
The CreateTable operation also performs so-called "auto-grant": When a
role creates a table, it is automatically granted full permissions to
read, write, change or delete that new table.
A test for all these things is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "DROP" permission on a table to
run a DeleteTable on it.
Moreover, when a table and its views are deleted, any special permissions
previously GRANTed on this table are removed. This is necessary because
if a role creates a table it is automatically granted permissions on this
table (this is known as "auto-grant" - see the CreateTable patch for
details). If this role deletes this table and later a second role creates
a table with the same name, we don't want the first role to have
permissions on this new table.
Tests for permission enforcements and revocation on delete are also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "MODIFY" permission on a table to
run a UpdateItem on it.
Only the MODIFY permission is required, even if the operation may also
read the old value of the item, such as a read-modify-write operation
or even using ReturnValues='ALL_OLD'.
A test is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "MODIFY" permission on a table to
run a DeleteItem on it.
Only the MODIFY permission is required, even if the operation may also
read the old value of the item (using ReturnValues='ALL_OLD').
A test is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a requirement for the "MODIFY" permission on a table to
run a PutItem on it.
Only the MODIFY permission is required, even if the operation may also
read the old value of the item (using ReturnValues='ALL_OLD').
A test is also added.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In this patch, we begin to add role-based access control (RBAC)
enforement to Alternator - in this patch only to GetItem.
After the preparation of client_state correctly in the previous patch,
the permission check itself in the get_item() function is very simple.
The bigger part of this patch is a full functional test in
test/alternator/test_cql_rbac.py. The test is quite self-explanatory
and heavily commented. Basically we check that a new role cannot
read with GetItem a pre-existing table, and we can add that ability
by GRANTing (in CQL) the new role the ability to SELECT the table,
the keyspace, all keyspaces, or add that ability to some other role
that this role inherits.
In the following patches, we will add role-based access control to
the Alternator operations, but the functional tests will be shorter -
we don't need to check the role inheritence, "all keyspaces" feature,
and so on, for every operation separately since they all use the
same underlying checking functions which handles these role inheritence
issues in exactly the same way.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Scylla uses a "client_state" object to encapsulate the information of
who the client is - its IP address, which user was authenticated, and so on.
For an unknown reason, Alternator created for each request an "internal"
client_state, meaning that supposedly the client for each request was
some sort of internal process (e.g., repair) rather than a real client.
This was wrong, and we even had a FIXME about not putting the client's
IP address in client_state.
So in this patch, we start using a normal "external" client_state
instead of an "internal" one. The client_state constructors are very
different in the two cases, so a few lines of code had to change.
I hope that this change will cause no functional changes. For example,
Alternator was already setting its own timeouts explicitly and not
relying on the default ones for external clients. However, we need to
fix this for the following patches which introduce permissions checks
(Role-Based Access Control - RBAC) - the client_state methods for
checking permissions become no-ops for *internal* clients (even if the
client_state contains an authenticated users). We need these functions
to do their job - so we need an *external* variant of client_state.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Commit 9f93dd9fa3 changed
`tablet_sstable_set::_sstable_sets` to be a `absl::flat_hash_map` and
in addition, `std::set<size_t> _sstable_set_ids` was
added. `_sstable_set_ids` is set up in the
`tablet_sstable_set(schema_ptr s, const storage_group_manager& sgm,
const locator::tablet_map& tmap)` constructor, but it is not copied in
`tablet_sstable_set(const tablet_sstable_set& o)`.
This affects the `tablet_sstable_set::tablet_sstable_set` method as it
depends on the copy constructor. Since sstable set can be cloned when
a new sstable set is added, the issue will cause ids not being copied
into the new sstable set. It's healed only after compaction, since the
sstable set is rebuilt from scratch there.
This PR fixes this issue by removing the existing copy constructor of
`tablet_sstable_set` to enable the implicit default copy constructor.
Fixes#19519Closesscylladb/scylladb#20115
* github.com:scylladb/scylladb:
boost/sstable_set_test: add testcase to test tablet_sstable_set copy constructor
replica: fix copy constructor of tablet_sstable_set
This patch adds metrics for batch get_item and batch write_item.
The new metrics record summary and histogram for latencies and batch size.
Batch sizes are implemented as ever-growing counters. To get the average batch size divide the rate of
the batch size counter by the rate of the number of batch counter:
```rate(batch_get_item_batch_size)/rate(batch_get_item)```
Relates to #17615
New code, No need to backport
Closesscylladb/scylladb#20190
* github.com:scylladb/scylladb:
Add tests for Alternator batch operation metrics
alternator/executor: support batch latency and size metrics
Add metrics for Alternator get and write batch operations
This patch adds unit tests to verify the correctness of the newly
introduced histogram metrics for get and write batch operation
latencies.
The test uses the existing latency test with the added metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch Updated the get and write batch operations in Alternator to
record latency using the newly added histogram metrics.
It adds logic to increment the counters with the number of items
processed in each batch.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Introduced histogram metrics to track latency for Alternator's get and
write batch operations.
Added counters to record the number of items processed in each batch
operation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.
Fixes#19519
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
this change is a follow up of 06c60f6ab, which updated the
2nd step of the test to use switch-case, but missed the 1st step.
so this change updates the first step of the test as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we check for the existence of "TestSuite" node
under the root of XML tree, and then enumerating all "TestSuite" nodes
under this "TestSuite", this approach works. but it
* introduces unnecessary indent
* is not very readable
in this change, we just use "./TestSuite/TestSuite" for enumerating
all "TestSuite" nodes under "TestSuite". simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The method cannot find the TestSuite in the XML file and fails the whole job, however tests are passed. The issue was in incorrect understanding of boost summarization method. It creates one file for all modes, so there is no need to go through all modes to convert the XML file for allure.
Closes: https://github.com/scylladb/scylladb/issues/20161Closesscylladb/scylladb#20165
Currently, each frozen mutation we get from
system_keyspace::query_mutations is unfrozen in whole
to a mutation and only then we check its key with
the provided `accept_keyspace` function.
This is wasteful, since they key can be processed
directly form the frozen mutation, before taking
the toll of unfreezing it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls
when freed.
For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
(inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
(inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```
This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time,
and by that, reducing memory footprint and preventing
reactor stalls.
Fixes#18173
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, `scylla sstable shard-of` didn't support tablets,
because:
- with tablets enabled, data distribution uses the scheduler
- this replaces the previous method of mapping based on vnodes and shard numbers
- as a result, we can no longer deduce sstable mapping from token ranges
in this change, we:
- read `system.tablets` table to retrieve tablet information
- print the tablet's replica set (list of <host, shard> pairs)
- this helps users determine where a given sstable is hosted
This approach provides the closest equivalent functionality of
`shard-of` in the tablet era.
Fixesscylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the `get_table_directory()` function will have applications
beyond its current use in `schema_loader.cc`. its ability to locate
the directory storing the sstables of given table could be valuable
in other subcommand(s) implementation.
so, in this change we extract it out into a dedicated source file,
so that it accept the primary_key and an optional clustering_key.
Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the `read_mutation_from_table_offline()` function will have applications
beyond its current use in `schema_loader.cc`. its ability to parser
mutation data from sstables could be valuable in other subcommand(s)
implementation.
so, in this change we extract it out into a dedicated source file,
so that it accept the primary_key and an optional clustering_key.
Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Split the extended list of source files across multiple lines.
This improves readability and makes future additions easier to
review in diffs.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the subcommand of "shard-of" does not support tablets yet. so let's
print out an error message, instead of printing the mapping assuming
that the sstables are distributed based on token only.
this commit also adds two more command line options to this subcommand,
so that user is required to specify either "--vnodes" or "--tablets"
to instruct the tool how the cluster distributes the tokens across nodes
and their shards. this helps to minimize the suprise of user.
this change prepares for the succeeding changes to implement the tablets
support.
the corresponding test is updated accordingly so that it only exercises
the "shard-of" subcommand without tablets. we will test it with tablets
enabled in a succeeding change.
Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Silently converting a tagged (i.e., "dimension-ful") integer to a naked
("dimensionless") integer defeats the purpose of having tagged integers,
and is a source of practical bugs, such as
<https://github.com/scylladb/scylladb/issues/20080>.
We could make the conversion operator explicit, for enforcing
static_cast<TAGGED_INTEGER_TYPE::value_type>(TAGGED_INTEGER_VALUE)
in every conversion location -- but that's a mouthful to write. Instead,
remove the conversion operator, and let clients call the (identically
behaving) value() member function.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "test/raft/randomized_nemesis_test.cc":
- addition of tagged and untagged (both should be tagged)
- taking the minimum of an index difference and a container size (both
should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
unpack and clean up the relatively complex
first_to_remain = max(snap.idx + 1 - preserve_log_entries, 0)
calculation in persistence::store_snapshot().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly untag the operands / arguments of the following operations, in
"test/raft/replication.hh":
- assignment to raft_cluster::_seen
- call to hasher_int::hash_range()
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
raft_cluster::get_states() passes a "start_idx" to create_log(), and
create_log() uses it as an "index_t" object. Match the type of "start_idx"
to its name.
This patch is best viewed with "git show -W".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In test_case::get_first_val(), the asssignment
first_val = initial_snapshots[initial_leader].snap.idx;
*both* relies on implicit conversion of the tagged integer type "index_t"
to the underlying "uint64_t", *and* is a logic bug, as reported at
<https://github.com/scylladb/scylladb/issues/20151>.
For now, wean the buggy asssignment off the disappearing
tagged-to-untaggged conversion.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Properly annotate index_t and term_t constants for use in
BOOST_CHECK_EQUAL() and BOOST_CHECK().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Properly annotate index_t and term_t constants for use in
BOOST_CHECK_EQUAL(), BOOST_CHECK(). Clean up the first args of
read_quorum() calls -- stay in term_t space.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The "from" and "to" parameters of compare_log_entries() are raft log
indices; change them to raft::index_t, and update the callers.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In raft_sys_table_storage::store_snapshot_descriptor(), the condition
preserve_log_entries > snap.idx
*both* relies on implicit conversion of the tagged integer type "index_t"
to the underlying "uint64_t", *and* is a logic bug, as reported at
<https://github.com/scylladb/scylladb/issues/20080>.
Ticket#20080 explains that this condition always evaluates to false in
practice, and that the "else" branch handles all cases correctly anyway.
For now, wean the buggy expression off the disappearing
tagged-to-untaggged conversion.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Currently, boost tests aren't using Junit. Enable Junit report output and clean them from skipped test, since boost tests are executed by function name rather than filename. This allows including boost tests result to the Allure report.
Related: https://github.com/scylladb/qa-tasks/issues/1665Closesscylladb/scylladb#19925
before this change, we assume user runs nodetool tests right under the root source directory. if user runs them under `test/nodetool`, the suppression rules are not applied. as the path is incorrect in that case.
after this change, the supression rules' path is deduced from the top src directory. so we can now run the nodetool test under `test/nodetool` .
---
no need to backport, this change improves developer's experience.
Closesscylladb/scylladb#20119
* github.com:scylladb/scylladb:
test/nodetool: deduce subpression path from top srcdir
test/nodetool: deduce path from top srcdir
Tenant names starting with `$` are reserved for internal ones.
Forbid creating new service level which name starts with `$`
and log a warning for existing service levels with `$` prefix.
Closesscylladb/scylladb#20122
`tools/toolchain/optimized_clang.sh` builds this target for creating
the profile in order to build clang optimized with this profile data.
so let's be compatible with `configure.py`, and add this target to
CMake building system as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20105
The lock_table() method needs database, ks and cf to find the table on
all shards. The same can be achieved with the help of global_table_ptr
thing that all the core callers already have at hand.
There's a test that doesn't have global table, but it can get one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20139
Typically the sstable_directory is constructed out of a table object.
Some code, namely tests and schema-loader, don't have table at hand and
construct directory out of schema, sharder, path-to-sstables, etc. This
code doesn't work with any storage options other than local ones, so
there's no need (yet) to carry this argument over.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20138
before this change, we look up for the mode using the command line
option as the key, but that's incorrect if the command line option does
not match with any of the known names. in that case, `test_mode` just
create another pair of <sstring, test_modes>, and return the second
component of this pair. and the second component is not what we expect.
we should have thrown an exception.
in this change
* the test_mode map is marked const.
* the overloads for parsing / formatting the `test_modes` type are
added, so that boost::program_options can parse and format it.
after this change, we print more user friendly error, like
```
/scylla perf-sstable --mode index-foo
error: the argument ('index-foo') for option '--mode' is invalid
Try --help.
```
instead of a bunch of output which is printed as if we passes the correct option as the argument of the `--mode` option.
---
it's an improvement of developer experience, hence no need to backport.
Closesscylladb/scylladb#20140
* github.com:scylladb/scylladb:
test/perf/perf_sstable: use switch-case when appropriate
test/perf/perf_sstables: use test_modes as the type of its option
This change fixes#17237, fixes#5361 and fixes#5362 by passing the limit value down the call chain in cql3. A test is also added.
fixes#17237fixes#5361fixes#5362
The regression happened in 5.4 as we changed the way GROUP BY is processed in 432cb02 - to force aggregation when it is used. The LIMIT value was not passed to aggregations and thus we failed to adhere to it.
W want to backport this fix to 5.4 and 6.0 to have continuous correct results for the test case from #17237
This patch consists of 4 commits:
- fa4225ea0fac2057b7a9976f57dc06bcbd900cd4 - cql3: respect the user-defined page size in aggregate queries - a precondition for this patch to be implementable
- 8fbe69e74dca16ed8832d9a90489ca47ba271d0b - cql3/select_statement: simplify the get_limit function - the `do_get_limit()` function did a lot of legwork that should not be associated with it. This change makes it trivial and makes its callers do additional checks (for unset guards, or for an aggregate query)
- 162828194a2b88c22fbee335894ff045dcc943c9 - cql3: process LIMIT for GROUP BY queries - pass the limit value down the chain and make use of it. This is the actual fix to #17237
- b3dc6de6d6cda8f5c09b01463bb52f827a6a00b4 - test/cql-pytest: Add test for GROUP BY queries with LIMIT - tests
Closesscylladb/scylladb#18842
* github.com:scylladb/scylladb:
test/cql-pytest: Add test for GROUP BY queries with LIMIT
cql3: process LIMIT for GROUP BY queries
cql3/select_statement: simplify the get_limit function
cql3: respect the user-defined page size in aggregate queries
Drop half-reversed (legacy) format of query::partition_slice.
The select query builds a fully reversed (native) slice for reversed queries and use it together with a reversed
schema to construct query::read_command that is further propagated to the database.
A cluster feature is added to support nodes that still operate on half-reversed slices. When the feature is turned off:
- query::read_command is transformed (to have table schema and half-reversed slices) before sending to other nodes
- query::read_command is transformed (to have query schema (reversed) and reversed slices) after receiving it from other nodes
- Similarly, mutations are transformed. They are reversed before being sent to other nodes or after receiving them from other nodes.
Additional manual tests were performed to test a mixed-node cluster:
1. 3-node cluster with one node upgraded: reverse read queries performed on an old node
2. 3-node cluster with one node upgraded: reverse read queries performed on a new node
3. 3-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on an old node
4. 3-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on a new node
All reverse read queries above consists of:
- single-partition reverse reads with no clustering key restrictions, with single column restrictions and multi column restrictions both with and without paging turned on
- multi-partition reverse reads with range restrictions with optional partition limit and partial ordering
The exact same tests were also performed on a fully upgraded cluster.
Fixes https://github.com/scylladb/scylladb/issues/12557Closesscylladb/scylladb#18864
* github.com:scylladb/scylladb:
mutation_partition: drop reverse parameter in compact_for_query
clustering_key_filter: unify get_ranges and get_native_ranges
streamed_mutation_freezer: drop the reverse parameter
reverse-reads.md: Drop legacy reverse format information
Fix comments refering to half-reversed (legacy) slices
select_statement::do_execute: Add tracing informaction
query::trim_clustering_row_ranges_to: require reversed schema for native reversed ranges
query-request: Drop half_reverse_slice as it is no longer used anywhere
readers: Use reversed schema and native reversed slices
database: accept reversed schema for reversed queries
storage_proxy: Support reverse queries in native format
query_pagers: Replace _schema with _query_schema
query_pagers: Support reverse queries in native format
select_statement: Execute reversed query in native format
storage_proxy::remote: Add support for mixed-node clusters
mutation_query: Add reversed function to reverse reconcilable_result
query-request: Add reversed function to reverse read_command
features: add native_reverse_queries
kl::reader::make_reader: Unify interface with mx::reader::make_reader
config: drop reversed_reads_auto_bypass_cache
config: drop enable_optimized_reversed_reads
This commit updates the configuration for ScyllaDB documentation so that:
- 6.1 is the latest version.
- 6.1 is removed from the list of unstable versions.
It must be merged when ScyllaDB 6.1 is released.
No backport is required.
Closesscylladb/scylladb#20041
With conversion of tagged integers to untagged ones going away, replace
static_cast<uint64_t>(snap.idx)
with
snap.idx.value()
Furthermore, casting "preserve_log_entries" (of type "size_t") to
"uint64_t" is redundant (both "snap.idx" and "preserve_log_entries" carry
nonnegative values, and the mathematical difference is expected to be
nonnegative); remove the cast.
Finally, simplify the initialization syntax.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly untag index_t and term_t values in the following two contexts:
- when they are passed to CQL queries as int64_t,
- when they are default-constructed as fallbacks for int64_t fields
missing from CQL result sets.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "raft/server.cc":
- addition of tagged and untagged (both should be tagged)
- subscripting an array by tagged (should be untagged)
- comparing a size-like threshold against tagged (should be untagged)
- exposing tagged via gauges (should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "raft/fsm.cc":
- addition of tagged and untagged (both should be tagged)
- comparison (relop) between tagged an untagged (both should be tagged)
- subscripting or sizing an array by tagged (should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in raft/log.{cc,h}:
- addition of tagged and untagged (both should be tagged)
- comparison (relop) between tagged an untagged (both should be tagged)
- subscripting an array, or offsetting an iterator, by tagged (should be
untagged)
- comparing an array bound against tagged (should be untagged)
- subtracting tagged from an array bound (should be untagged)
Note: these files mix uniform initialization syntax (index_t{...}) with
constructor call syntax (index_t()), with the former being more frequent.
Stick with the former here too, for consistency.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Internally, increment_and_get_generation() produces a
"gms::generation_type" value.
In turn, all callers of increment_and_get_generation() -- namely
scylla_main() [main.cc] and single_node_cql_env::run_in_thread()
[test/lib/cql_test_env.cc] -- pass the resolved value to
storage_service::init_address_map() and storage_service::join_cluster(),
both of which take a "gms::generation_type".
Therefore it is pointless to "untag" the generation value temporarily
between the producer and the consumers. Correct the return type of
increment_and_get_generation().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The callers of gossiper::compare_endpoint_startup() need not (should not)
learn of any particular (tagged or untagged) difference of generations;
they only care about the ordering of generations. Change the return type
of compare_endpoint_startup() to "std::strong_ordering", and delegate the
comparison to tagged_tagged_integer::operator<=>.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In do_sort(), we need to drop to "int32_t" temporarily, so that we can
call ::abs() on the version difference. Do that explicitly.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
If system_keyspace::stop() is called before system_keyspace::shutdown(),
it will never finish, because the uncleared shared pointers will keep
it alive indefinitely.
Currently this can happen if an exception is thrown before the construction
of the shutdown() defer. This patch moves the shutdown() call to immediately
before stop(). I see no reason why it should be elsewhere.
Fixesscylladb/scylla-enterprise#4380Closesscylladb/scylladb#20089
instead of using a chain of `if-else`, use switch-case instead,
it's visually easier to follow than `if`-`else` blocks. and since
we never need to handle the `else` case, the `throw` statement
is removed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we look up for the mode using the command line
option as the key, but that's incorrect if the command line option does
not match with any of the known names. in that case, `test_mode` just
create another pair of <sstring, test_modes>, and return the second
component of this pair. and the second component is not what we expect.
we should have thrown an exception.
in this change
* the test_mode map is marked const.
* the overloads for parsing / formatting the `test_modes` type are
added, so that boost::program_options can parse and format it.
after this change,
* we can print more user friendly error, like
```
/scylla perf-sstable --mode index-foo
error: the argument ('index-foo') for option '--mode' is invalid
Try --help.
```
instead of a bunch of output which is printed as if we passes the
correct option as the argument of the `--mode` option.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required to execute this req,
2. global topo req is executed by topo coordinator, which reads data attached to the req.
The KS name is among the data attached to the req. There's a time window between these steps where a to-be-altered KS could have been DROPped, which results in topo coordinator forever trying to ALTER a non-existing KS. In order to avoid it, the code has been changed to first check if a to-be-altered KS exists, and if it's not the case, it doesn't perform any schema/tablets mutations, but just removes the global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected changes, which is due to the fact that the code is written badly and needs to be refactored - an effort that's already planned under #19126
(I suggest to disable displaying whitespace differences when reviewing this PR).
Fixes: scylladb/scylladb#19576Closesscylladb/scylladb#19666
* github.com:scylladb/scylladb:
tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
cql: refactor rf_change indentation
Prevent ALTERing non-existing KS with tablets
Using the error injection framework, we inject a sleep into the
processing path of ALTER tablets KS, so that the topology coordinator of
the leader node
sleeps after the rf_change event has been scheduled, but before it is
started to be executed. During that time the second node executes a DROP
KS statement, which is propagated to the leader node. Once leader node
wakes up and resumes processing of ALTER tablets KS, the KS won't exist
and the node cannot crash, which was the case before.
The lister resembles the directory_lister from util -- it returns
entries upon its .get() invocation, and should be .close()d at the end.
Internally the lister issues ListObjectsV2 request with provided prefix
and limits the server with the amount of entries returned not to consume
too much local memory (we don't have streaming XML parser for response).
If the result is indeed truncated, the subsequent calls include the
continuation token as per [1]
[1] https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method logic is clean and simple -- load sstable from the descriptor and sort it into one of collections (local, shared, remote, unsorted). To achieve that there's a bunch of helper methods, but they duplicate functionality of each other. Squashing most of this code into process_descriptor() makes it easier to read and keeps sstable_directory private API much shorter.
Closesscylladb/scylladb#20126
* github.com:scylladb/scylladb:
sstable_directory: Open-code load_sstable() into process_descriptor()
sstable_directory: Squash sort_sstable() with process_descriptor()
sstable_directory: Remove unused sstable_filename(desc) helper
sstable_directory: Log sst->get_filename(), not sstable_filename(desc)
sstable_directory: Keep loaded sst in local var
sstable_directory: Remove unused helpers
sstable_directory: Load sstable once when sorting
There are two load_sstable() overloads, and one of them is only used
inside process_descriptor(). What this loading helper does is, in fact,
processes given descriptor, so it's worth having it open-coded into its
caller.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter (caller) loads sstable, so does the former, so load it once
and then put it in either list/set, depending on flags and shard info.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some places that log sstable Data file name via sstable
descriptor. After previous patching all those loggers have sstable at
hand and can use sstable::get_filename() instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to decide which list to put sstable into, the sort_sstable()
first calls get_shards_for_this_sstable() which loads the sstable
anyway. If loaded shards contain only the current one (which is the
common case) sstable is loaded again. In fact, if the sstable happens to
be remote it's loaded anyway to get its open info.
Fix that by loading sstable, then getting shards directly from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The reverse parameter is no longer used with native reverse reads.
The row ranges are provided in native reverse order together with
a reversed schema, thus the reverse parameter remain false all the
time and can be droped.
When a reverse slice is provided, it is given in the native reverse
format. Thus the ranges will be returned in the same order as stored
in the slice.
Therefore there is no need to distinguish between get_ranges and
get_native_ranges. The latter one gets dropped and get_ranges returns
ranges in the same order as stored in the slice.
The reverse parameter is no longer used with native reverse reads.
A reversed schema is provided and thus the reverse parameter shall
remain false all the time.
Simplify implementation and for clustering key ranges in native
reversed format, require a reversed table schema.
Trimming native reversed clustering key ranges requires a reversed
schema to be passed in. Thus, the reverse flag is no longer required
as it would always be set to false.
The reconcilable_result is built as it would be constructed for
forward read queries for tables with reversed order.
Mutations constructed for reversed queries are consumed forward.
Drop overloaded reversed functions that reverse read_command and
reconcilable_result directly and keep only those requiring smart
pointers. They are not used any more.
Remove schema reversing in query() and query_mutations() methods.
Instead, a reversed schema shall be passed for reversed queries.
Rename a schema variable from s into query_schema for readability.
For reversed queries, query_result() method accepts a reversed table
schema and read_command with a query schema version and a slice in
native reversed format.
Support mixed-node clusters. In such a case, the feature flag
native_reverse_queries is disabled and the read_command in sent
to replicas in the old regacy format (stores table schema version
and a slice in the legacy reverse format).
After the reconciliation, for the read+repair case, un-reversed
mutations are sent to replicas, i.e. forward ones.
Use a reversed schema and a native reversed slice when constructing
a read_command and executing a reversed select statement.
Such a created read_command is passed further down to query_pagers::pager
and storage::proxy::query_result that transform it to the format
they accept/know, i.e. lagacy.
In handle_read, detect whether a coming read_command is in the
legacy reversed format or native reversed format.
The result will be used to transform the read_command between format
as well as to transforms the results before they are send back to
the coordinator.
The reconcilable_result is reversed by reversing mutations for all
paritions it holds. Reversing is asynchronous to avoid potential
stall.
Use for transitions between legacy and native formats and in order
to support mixed-nodes clusters.
The read_command is reversed by reversing the schema version it
holds and transforming a slice from the legacy reversed format to
the native reversed format.
Use for trasition between format and to support mixed-nodes clusters
Reverse reads have already been with us for a while, thus this back
door option to read entire paritions forward and reversing them after
can be retired.
When signing AWS query one need to prepare "query string" which is a
line looking like `encode(query_param)=encode(query_value)&...`. Encoded
are only the query parameter names and values. It was missing in current
code and so far worked because no encodable characters were used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Consider the following:
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In order to fix the race between split and repair, we must introduce
the ability to split an "offline" sstable, one that wasn't added
to any of the table's sstable set yet.
It's not safe to split a sstable after adding it to the set, because
a failure to split can result in unsplit data left in the set, causing
split to fail down the road, since the coordinator thinks this replica
has only split data in the set.
Refs #19378.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
All lambdas passed to test_using_reusable_sst() conform to the prototype
void (test_env&, sstable_ptr)
All lambdas passed to test_using_reusable_sst_returning() conform to the
prototype
NON_VOID (test_env&, sstable_ptr)
The common parameter list of both prototypes can be expressed with the
concept
std::invocable<test_env&, sstable_ptr>
Once a "Func" template parameter (i.e., function type) satisfying this
concept is taken, then "Func"'s void or non-void return type can be
commonly expressed with
std::invoke_result_t<Func, test_env&, sstable_ptr>
In turn, test_env::do_with_async_returning<...> can be instantiated with
this return type, even if it happens to be "void".
([stmt.return] specifies, "[a] return statement with an operand of type
void shall be used only in a function that has a cv void return type",
meaning that
return func(env)
will do the right thing in the body of
test_env::do_with_async_returning<void>().)
Merge test_using_reusable_sst() and test_using_reusable_sst_returning()
into one. Preserve the function name from the former, and the
test_env::do_with_async_returning<...>() call from the latter.
Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#20090
there are chances that developer launch `pytest` right under
`test/nodetool`, in that case current working directory is not
the root directory of the project, so the path to suppression rules
does not point to a file.
to cater the needs to run the test under `test/nodetool`, let's
use the path deduced from the top_srcdir.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Unit testing for the SSTable validation API happens in
`sstable_validate_test`. Currently, this test checks the API against
some invalid SSTables with out-of-order clustering rows and out-of-order
partitions. However, both are types of content-level corruption that do
not trigger `malformed_sstable_exception` errors.
Extend the test to cover cases of file-level corruption as well, i.e.,
cases that would raise a `malformed_sstable_exception`. Construct an
SSTable with an invalid checksum to trigger this.
This is part of the effort to improve scrub to handle all kinds of
corruption.
Fixesscylladb/scylladb#19057
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#20096
`maybe_rehash` is complimentary and is not strictly require to succeed. If it fails, it will retry on the next call, but there's no reason to throw an exception that will fail its caller, since `maybe_rehash` is called as the final step after the caller has already succeeded with its action.
Minor enhancement for the error path, no backport required.
Closesscylladb/scylladb#19910
* github.com:scylladb/scylladb:
cell_locker: maybe_rehash: reindent
cell_locker: maybe_rehash: ignore allocation failures
Currently, each change to tablet metadata triggers a full metadata reload from disk. This is very wasteful, especially if the metadata change affects only a single row in the `system.tablets` table. This is the case when the tablet load balancer triggers a migration, this will affect a single row in the table, but today will trigger a full reload.
We expect tablet count to potentially grow to thousands and beyond and the overhead of this full reload can become significant.
This PR makes tablet metadata reload partial, instead of reloading all metadata on topology or schema changes, reload only the partitions that are affected by the change. Copy the rest from the in-memory state.
This is done with two passes: first the change mutations are scanned and a hint is produced. This hint is then passed down to the reload code, which will use it to only reload parts (rows/partitions) of the metadata that has actually changed.
The performance difference between full reload and partial reload is quite drastic:
```
INFO 2024-07-25 05:06:27,347 [shard 0:stat] testlog - Tablet metadata reload:
full 616.39ms
partial 0.18ms
```
This was measured with the modified (by this PR) `perf_tablets`, which creates 100 tables, each with 2K tablets. The test was modified to change a single tablet, then do a full and partial reload respectively, measuring the time it takes for reach.
Fixes: #15294
New feature, no backport needed.
Closesscylladb/scylladb#15541
* github.com:scylladb/scylladb:
test/perf/perf_tablets: add tablet metadata reload perf measurement
test/boost/tablets_test: add test for partial tablet metadata updates
db/schema_tables: pass tablet hint to update_tablet_metadata()
service/storage_service: load_tablet_metadata(): add hint parameter
service/migration_listener: update_tablet_metadata(): add hint parameter
service/raft/group0_state_machine: provide tablet change hint on topology change
service/storage_service: topology_state_load(): allow providing change hint
replica/tablets: add update_tablet_metadata()
replica/tablets: fix indentation
replica/tablets: extract tablet_metadata builder logic
replica/tablets: add get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint()
locator/tablets: add tablet_map::clear_tablet_transition_info()
locator/tablets: make tablet_metadata cheap to copy
mutation/canonical_mutation: add key()
Measure reload perf of full reload vs. partial reload, after changing a
single tablet.
While at it, modify the `--tablets-per-table` parameter, so that it has
a default parameter which works OOTB. The previous default was both too
large (causing oversized commitlog entry errors) and not a power of two.
Replace the has_tablet_mutations in `merge_tables_and_views()` with a
hint parameter, which is calculated in the caller, from the original
schema change mutations. This hint is then forwarded to the notifier's
`update_tablet_metadata()` so that subscribers can refresh only the
tablet partitions that changed.
The hint contains information related to what exactly changed, allowing
listeners to do partial updates, instead of reloading all metadata on
each notification.
Extract a hint of what a tablet mutation changed. The hint can be later
used to selectively reload only the changed parts from disk.
Two variants are added:
* get_tablet_metadata_change_hint() - extracts a hint from a list of
tablet mutations
* update_tablet_metadata_change_hint() - updates an existing hint based
on a single mutation, allowing for incremental hint extraction
Keep lw_shared_ptr<tablet_map> in the tablet map and use COW semantics.
To prevent accidental changes to shared tablet_map instances, all
modifications to a tablet_map have to go through a new
`mutate_tablet_map()` method, which implements the copy-modify-swap
idiom.
Fixes#19960
Write path for sstables/commitlog need to handle the fact that IO extensions can
generate errors, some of which should be considered retry-able, and some that should,
similar to system IO errors, cause the node to go into isolate mode.
One option would of course be for extensions to simply generate std::system_errors,
with system_category and appropriate codes. But this is probably a bad idea, since
it makes it more muddy at which level an error happened, as well as limits the
expressibility of the error.
This adds three distinct types (sharing base) distinguishing permission, availabilty
and configuration errors. These are treated akin to EACCESS, ENOENT and EINVAL in
disk error handler and memtable write loop.
Tests updated to use and verify behaviour.
Closesscylladb/scylladb#19961
After removal of rwlock (53a6ec05ed), the race was introduced because the order that
compaction groups of a tablet are closed, is no longer deterministic.
Some background first:
Split compaction runs in main (unsplit) group, and adds sstable to left and right groups
on completion.
The race works as follow:
1) split compaction starts on main group of tablet X
2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel
3) left or right group are closed before main (more likely when only main has flush work to do)
4) split compaction completes, and adds sstable to left and right
5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that
happens in row cache update's execute(), node crashes.
The problem manifested as follow:
[shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0
...
[shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...]
...
[shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found),
at: ...
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(...
--------
seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::(anonymous namespace)::thread_wake_task
--------
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(...
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu...
From the log above, it can be seen cache update failure happens under streaming sched group and
during compaction completion, which was good evidence to the cause.
Problem was reproduced locally with the help of tablet shuffling.
Fixes: #19873.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#19987
If parent_info argument of compaction_manager::perform_compaction
is std::nullopt, then created compaction executor isn't tracked by task
manager. Currently, all compaction operations should by visible in task
manager.
Modify split methods to keep split executor in task manager. Get rid of
the option to bypass task manager.
Closesscylladb/scylladb#19995
* github.com:scylladb/scylladb:
compaction: replace optional<task_info> with task_info param
compaction: keep split executor in task manager
There are some bugs missed in task handler:
- wait_for_task does not wait until virtual tasks are done, but returns the status immediately;
- wait_for_task suffers from use after return;
- get_status_recursively does not set the kind of task essentials.
Fix the aforementioned.
Closesscylladb/scylladb#19930
* github.com:scylladb/scylladb:
test: add test to check that task handler is fixed
tasks: fix task handler
Remove xfail from all tests for #5361, as the issue is fixed.
Remove xfail from test_group_by_clustering_prefix_with_limit
It references #5362, but is fixed by #17237.
Refs #17237
Currently LIMIT not passed to the query executor at all and it was just
an accident that it worked for the case referenced in #17237. This
change passes the limit value down the chain.
The get_limit() function performed tasks outside of its scope - for
example checked if the statement was an aggregate. This change moves the
onus of the check to the caller.
The comment in the code already states that we should use the
user-defined page size if it's provided. To avoid OOM conditions we'll
use the internally defined limit as the upper bound or if no page size
is provided.
This change lays ground work for fixing #5362 and is necessary to pass
the test introduced in #19392 once it is implemented.
This patch adds `suppress_features` error injection. It allows to revoke
support for some features and it can be used to simulate upgrade process
in test.py.
Features to suppress are passed as injection's value, separated by `;`.
Example: `PARALLELIZED_AGGREGATION;UDA_NATIVE_PARALLELIZED_AGGREGATION`
Fixesscylladb/scylladb#20034Closesscylladb/scylladb#20055
before this change, we use the default options for
performing read on the input. and the default options
is like
```c++
struct file_input_stream_options {
size_t buffer_size = 8192; ///< I/O buffer size
unsigned read_ahead = 0; ///< Maximum number of extra read-ahead operations
};
```
which is not able to offer good throughput when
reading from disk, when we stream to S3.
so, in this change, we use options which allows better throughput.
Refs 061def001d
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20074
Before these changes, we didn't specify which I/O scheduling
group commitlog instances in hinted handoff should use.
In this commit, we set it explicitly to the commitlog
scheduling group. The rationale for this choice is the fact
we don't want to cause a bottleneck on the write path
-- if hints are written too slowly, new incoming mutations
(NOT hints) might be rejected due to a too high number
of hints currently being written to disk; see
`storage_proxy::create_write_response_handler_helper()`
for more context.
Fixesscylladb/scylladb#18654Closesscylladb/scylladb#19170
This patch makes all cql connections update theirs service level parameters automatically when:
- any service level is created or changed
- one role is granted to another
- any service level is attached to/detached from a role
First of all, the patch defines what a service level and an effective service level are 938aa10509. No new type of service levels are introduced, the commit only clarifies definitions and names what an effective service level is.
(Effective service level is created by merging all service levels which are attached to all roles granted to the user. It represents exact values of connection's parameters.)
Previously, to find an effective service level of a user, it required O(n) internal queries: O(n) queries to recursively find all granted roles (`standard_role_manager::query_granted()`) and a query for each role to get its service level (`standard_role_manager::get_attribute()`, which sums to O(n) queries).
Because we want to reload SL parameters for all opened cql connections, we don't want to do O(n) queries for every connection, every time we create or change any service level/grant one role to another/attach or detach a service level to/from a role.
To speed it up, the patch adds another layer of service level controller cache, which stored `role_name -> effective_service_level` mapping. This way finding a effective service level for a role is only a lookup to a map.
Building the new cache requires only 2 queries: one to obtain all role hierarchy one to get all roles' service level.
Fixesscylladb/scylladb#12923Closesscylladb/scylladb#19085
* github.com:scylladb/scylladb:
test/auth_cluster/test_raft_service_levels: add test for automatic connection update
api/cql_server_test: add CQL server testing API
transport/cql_server: subscribe to sl effective cache reloaded
transport/controller: coroutinize `subscribe_server` and `unsubscribe_server`
transport/cql_server: add method to update service level params on all connections
generic_server: use async function in `for_each_gently()`
service/qos/sl_controller: use effective service levels cache
service/qos/service_level_controller: notify subscribers on effective cache reloaded
service/raft/group0_state_machine: update effective service levels cache
service/topology_coordinator: migrate service levels before auth
service/qos/service_level_controller: effective service levels cache
utils/sorting: allow to pass any container as verticies
service/qos/service_level_controller: replace shard check to assert
service/qos: define effective service level
service/qos/qos_common: use const reference in `init_effective_names()`
service/qos/service_level_controller: remove unused field
auth: return map of directly granted roles
test/auth/test_auth_v2_migration: create sl1 in the test
We have two mechanism to give visibility into reads having to process many tombstones:
* a warning in the logs, triggered if a read processed more the `tombstone_warn_threshold` dead rows/tombstones
* a trace message, which includes stats of the amount of rows in the page, including the amount of live and dead rows as well as tombstones
This series extends this to also include information on cells, so we have visibility into the case where a read has to process an excessive amount of cell tombstones (mainly because of collections).
A log line is now also logged if the amount of dead cells/tombstones in the page exceeds `tombstone_warn_threshold`. The trace message is also extended to contain cell stats.
The `tombstone_warn_threshold` log lines now receive a 10s rate-limit to avoid excessive log spamming. The rate-limit is separate for the row and cell logs.
Example of the new log line (`tombstone_warn_threshold=10` ):
```
WARN 2024-05-30 07:56:44,979 [shard 0:stmt] querier - Read 98 live cells and 126 dead cells/tombstones for system_schema.scylla_tables <partition-range-scan> (-inf, +inf) (see tombstone_warn_threshold)
```
Example of the new tracing message:
```
Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 13 cell(s) (1 live, 12 dead) [shard 0] | 2024-05-30 08:13:19.690803 | 127.0.0.1 | 6114 | 127.0.0.1
```
Fixes: https://github.com/scylladb/scylladb/issues/18996
Improvement, not a backport candidate.
Closesscylladb/scylladb#18997
* github.com:scylladb/scylladb:
test/boost: mutation_test: add test for cell compaction stats
mutation/compact_and_expire_result: drop operator bool()
querier: consume_page(): add rate-limiting to tombstone warnings
querier: consume_page(): add cell stats to page stats trace message
querier: consume_page(): add tombstone warning for cell tombstones
querier: consume_page(): extract code which logs tombstone warning
mutation/mutation_compactor: collect and aggregate cell compaction stats
mutation: row::compact_and_expire(): use compact_and_expire_result
collection_mutation: compact_and_expire(): use compact_and_expire_result
mutation: introduce compact_and_expire_result
Fixes#13334
All required code paths (see enterprise) now uses
extensions::is_extension_internal_keyspace.
The old mechanism can be removed. One less global var.
Closesscylladb/scylladb#20047
This is a followup to #19937, for #19803. See in particular [this comment](https://github.com/scylladb/scylladb/issues/19803#issuecomment-2258371923).
The primary conversion target is coroutines. However, while coroutines are the most convenient style, they are only infrequently usable in this case, for the following reasons:
- Wherever we have a `future::finally()` that calls a cleanup function that returns a future (which must be awaited), we cannot use `co_await`. We can only use `seastar::async()` with `deferred_close` or `defer()`.
- The code passes lots of lambdas, and `co_await` cannot be used in lambdas. First, I tried, and the compiler rejects it; second, a capturing lambda that is a coroutine is a trap [[1]](https://devblogs.microsoft.com/oldnewthing/20211103-00/?p=105870) [[2]](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rcoro-capture).
In most cases, I didn't have to use naked `seastar::async()`; there were specialized wrappers in place already. Thus, most of the changes target `seastar::thread` context under existent `seastar::async()` wrappers, and only a few functions end up as coroutines.
The last patch in the series (`test/sstable: remove useless variable from promoted_index_read()`) is an independent micro-cleanup, the opportunity for which I thought to have noticed while reading the code.
The tail of `test/boost/sstable_test.cc` (the stuff following `promoted_index_read()`) is already written as `seastar::thread`. That's already better (for readability) than future chaining; but could have I perhaps further converted those functions to coroutines? My answer was "no":
- Some of the candidate functions relied on deferred cleanups that might need to yield (all three variants of `count_rows()`).
- Some had been implemented by passing lambdas to wrappers of `seastar::async()` (`sub_partition_read()`, `sub_partitions_read()`).
- The test case `test_skipping_in_compressed_stream()` initially looked promising for co-routinization (from its starting point `seastar::async()`), because it seemed to employ no deferred cleanup (that might need to yield). However, the function uses three lambdas that must be able to yield internally, and one of those (`make_is()`) is even capturing.
- The rest (`test_empty_key_view_comparison()`, `test_parse_path_good()`, `test_parse_path_bad()`) was synchronous code to begin with.
```
test/boost/sstable_test.cc | 188 +++++++++-----------
1 file changed, 83 insertions(+), 105 deletions(-)
```
Refactoring; no backport needed.
Closesscylladb/scylladb#20011
* github.com:scylladb/scylladb:
test/sstable: remove useless variable from promoted_index_read()
test/sstable: rewrite promoted_index_read() with async()
test/sstable: unfuturize lambda invocation in test_using_reusable_sst*()
test/sstable: rewrite wrong_range() with async()
test/sstable: simplify not_find_key_composite_bucket0() under test_using_reusable_sst()
test/sstable: rewrite full_index_search() with async()
test/sstable: simplify find_key*(), all_in_place() under test_using_reusable_sst()
test/sstable: rewrite (un)compressed_random_access_read() with async()
test/sstable: simplify write_and_validate_sst()
test/sstable: simplify check_toc_func() under async()
test/sstable: simplify check_statistics_func() under async()
test/sstable: simplify check_summary_func() under async()
test/sstable: coroutinize check_component_integrity()
test/sstable: rewrite write_sst_info() with async()
test/sstable: simplify missing_summary_first_last_sane()
test/sstable: coroutinize summary_query_fail()
test/sstable: rewrite summary_query() with async()
test/sstable: coroutinize (simple/composite)_index_read()
test/sstable: rewrite index_read() with async()
test/sstable: rewrite test_using_reusable_sst() with async()
test/sstable: rewrite test_using_working_sst() with async()
Add a CQL server testing API with and endpoint to dump
service level parameters of all CQL connections.
This endpoint will be later used to test functionality of
automated updating CQL connections parameters.
Make cql server (but not maintenance server) is subscribed to qos
configuration change.
Trigger update of connections' service level params on effective cache
reloaded event.
It's not done on maintenance server because it doesn't support role
hierarchy nor attaching service levels.
connections
Trigger update of service level param on every cql connection.
In enterprise, the method needs also to update connections' scheduling
group.
In the following patch, we will add a method to update service levels
parameters for each cql connections.
To support this, this patch allows to pass async function as a parameter
to `for_each_gently()` method.
Updates to `system.role_members` and `system.role_attributes` affect
effective service levels cache, so applying mutations to those tables
should reload the effective SL cache.
Effective service level cache will be updated when mutations are applied to
some of the auth tables.
But the effective cache depends on first-level service levels cache, so
service levels data should be migrated before auth data.
Add a second layer of service_level_controller cache which contains
role name -> effective service level mapping.
To build the mapping, controller uses first cache layer (service level
name -> service level) and 2 queries to auth tables (one to `roles` and
one to `role_members`).
The container containing all verticies doesn't have to be a vector.
Allowing to pass any container that meet conditions, will make to
function more flexible.
Write down definitions of `service level` and `effective service level`
in service/qos/service_level_controller.hh.
Until now, effective service level was only used as result of
`LIST EFFECTIVE SERVICE LEVEL OF <role>`.
Now we want to have quick access to effective service level of
each role and introduce cache of effective sl to do it.
New definitions clarify things.
The commit also renames:
- `update_service_levels_from_distributed_data` -> `update_service_levels_cache`
Later we will introduce effective_service_level_cache, so this change
standarizes the names.
- `find_service_level` -> `find_effective_service_level`
The function actualy returns effective service level.
Returns multimap of directly granted roles for each role. Uses
only one query to create the map, instead of doing recursive queries
for each individual role.
Test `test_auth_v2_migration` creates auth data where role `users`
has assigned service level `sl:fefe` but the service level isn't
actually created.
In following patches, we are going to introduce effective service levels
cache which depends on auth and is refreshed when mutations are applied
to v2 auth tables.
Without this changes, this test will fail because the service level
doesn't exist.
Also the name `sl:fefe` is change to `sl1`.
It runs in the background and consists of two parts -- async() lambda and following .then()-s. This PR move the background running code into its own method and coroutinizes it in parts. With #19954 merged it finally looks really nice.
Closesscylladb/scylladb#20058
* github.com:scylladb/scylladb:
view_builder: Restore indentation after previous patches
view_builder: Coroutinize inner start_in_background() calls
view_builder: Coroutinize outer start_in_background() calls
view_builder: Add helper method for background start
Currently we print an ERROR on all exceptions in
`raft_topology_cmd_handler`. This log level is too high, in some cases
exceptions are expected -- like during shutdown. And it causes dtest
failures.
Turn exceptions from aborts into WARN level.
Also improve logging by printing the command that failed.
Fixesscylladb/scylladb#19754Closesscylladb/scylladb#19935
If tablet-based table is created concurrently with node being
decommissioned after tablets are already drained, the new table may be
permanently left with replicas on the node which is no longer in the
topology. That creates an immidiate availability risk because we are
running with one replica down.
This also violates invariants about replica placement and this state
cannot be fixed by topology operations.
One effect is that this will lead to load balancer failure which will
inhibit progress of any topology operations:
load_balancer - Replica 154b0380-1dd2-11b2-9fdd-7156aa720e1a:0 of tablet 7e03dd40-537b-11ef-9fdd-7156aa720e1a:1 not found in topology, at: ...
Fixes#20032Closesscylladb/scylladb#20053
Sync points are created, via POST HTTP requests, for a subset of nodes
in the cluster. Those nodes are specified in a request's parameter
`target_hosts`. When the parameter is empty, Scylla should assume
the user wants to create a sync point for ALL nodes.
Before these changes, sync points were created only for LIVE nodes.
If a node was dead but still part of the cluster and the user
requested creating a sync point leaving the parameter `target_hosts`
empty, the dead node was skipped during the creation of the sync point.
That was inconsistent with the guarantees the sync point API provides.
In this commit, we fix that issue and add a test verifying that
the changes have made the implementation compliant with the design
of the sync point API -- the test only passes after this commit.
Fixesscylladb/scylladb#9413Closesscylladb/scylladb#19750
One of the co_await-ed parts of this method is async() lambda. It can be
coroutinized too. One thing to care is the semaphore units -- its scope
should (?) terminate earlier than the whole start_in_background() so
release it explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method consists of two parts -- one running in async() thread and
continuations to it. This patch turns the latter chain into co_await-s.
The mentioned chain is "guarded" by then_wrapped() catch of any
exception, which is turned into a plain try-catch block.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The view_builder::start() happens in the background. It's good to have
explicit start_in_background() method and coroutinize it next.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In this commit, we describe the mechanism of sync point
in Hinted Handoff in the user documentation. We explain
the motivation for it and how to use it, as well as list
and describe all of the parameters involved in the process.
Errors that may appear and experienced by the user
are addressed in the article.
Fixesscylladb/scylladb#18500Closesscylladb/scylladb#19686
The get_all_data_file_locations and get_saved_caches_location get the
returned data from db::config and should be next other endpoints working
with config data.
refs: #2737
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19958
When starting, view builder wants all shards to synchronize with each other in the middle of initialization. For that they all synchronize via shard-0's instance counter and a shared future. There's cross-shard barrier in utils/ that provides the same facility.
Closesscylladb/scylladb#19954
* github.com:scylladb/scylladb:
view_builder: Drop unused members
view_builder: Use cross-shard barrier on start
view_builder: Add cross-shard barrier to its .start() method
Having an operator bool() on this struct is counter-intuitive, so this
commit drops it and migrates any remaining users to bool is_live().
The purpose of this operator bool() was to help in incrementally replace
the previous bool return type with compact_and_expire_result in the
compact_and_expire() call stack. Now that this is done, it has served
its purpose.
Soon, we want to log a warning on too many cell tombstones as well.
Extract the logging code to allow reuse between the row and cell
tombstone warnings.
There are some bugs missed in task handler:
- wait_for_task does not wait until virtual tasks are done, but
returns the status immediately;
- wait_for_task suffers from use after return;
- get_status_recursively does not set the kind of task essentials.
Fix the aforementioned.
This commit extracts the information about the configuration the user should do
right after installation (especially running scylla_setup) to a separate file.
The file is included in the relevant pages, i.e., installing with packages
and installing with Web Installer.
In addition, the examples on the Web Installer page are updated
with supported versions of ScyllaDB.
Fixes https://github.com/scylladb/scylladb/issues/19908Closesscylladb/scylladb#20035
Add more logging for raft-based topology operations in INFO and DEBUG
levels.
Improve the existing logging, adding more details.
Fix a FIXME in test_coordinator_queue_management (by readding a log
message that was removed in the past -- probably by accident -- and
properly awaiting for it to appear in test).
Enable group0_state_machine logging at TRACE level in tests. These logs
are relatively rare (group 0 commands are used for metadata operations)
and relatively small, mostly consist of printing `system.group0_history`
mutation in the applied command, for example:
```
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}}
```
note that the mutation contains a human-readable description of the
command -- like "create system_distributed keyspace" above.
These logs might help debugging various issues (e.g. when `apply` hangs
waiting for read_apply mutex, or takes too long to apply a command).
Ref: scylladb/scylladb#19105
Ref: scylladb/scylladb#19945Closesscylladb/scylladb#19998
This PR adds the 6.0-to-6.1 upgrade guide (including metrics) and removes the 5.4-to-6.0 upgrade guide.
Compared 5.4-to-6.0, the the 6.0-to-6.1 guide:
- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
so now there's no scenario that would require the user to follow the validation procedure.
- Removed the references to the Enable Consistent Topology Updates page (which was in version 6.0 and is removed with this PR) across the docs.
See the individual commits for more details.
Fixes https://github.com/scylladb/scylladb/issues/19853
Fixes https://github.com/scylladb/scylladb/issues/19933
This PR must be backported to branch-6.1 as it is critical in version 6.1.
Closesscylladb/scylladb#19983
* github.com:scylladb/scylladb:
doc: remove the 5.4-to-6.0 upgrade guide
doc: add the 6.0-to-6.1 upgrade guide
Currently, the resource utilization in CI is low. Increasing the number of clusters will increase how many tests are executed simultaneously. This will decrease the time it takes to execute and improve resource utilization.
Related: https://github.com/scylladb/qa-tasks/issues/1667Closesscylladb/scylladb#19832
The command has a singl check for the missing keyspace and/or table
parameters and if the check fails, there is a combined error message.
Apparently this is confusing, so split the check so that missing
keyspace and missing table args have its own check and error message.
Fixes: scylladb/scylladb#19984Closesscylladb/scylladb#20005
This commit removes the 5.4-to-6.0 upgrade guide and all references to it.
It mainly removes references to the Enable Consistent Topology Updates page,
which was added as enabling the feature was optional.
In rare cases, when a reference to that page is necessary,
the internal link is replaced with an external link to version 6.0.
Especially the Handling Cluster Membership Change Failures page was modified
for troubleshooting purposes rather than removed.
instead of reinventing the wheel, let's use the existing one.
in this change, we trade the `div_ceil()` implementated in s3/client.cc
for the existing one in utils/div_ceil.hh . because we are not using
`std::lldiv()` anymore, the corresponding `#include <cstdlib>` is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20000
instead of using `std::rethrow_exception()`, use
`coroutine::return_exception_ptr()` which is a little bit more
efficient.
See also 6cafd83e1c
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20001
fmt 11 enforces the constness of `format()` member function, if it
is not marked with `const`, the tree fails to build with fmt 11, like:
```
/usr/include/fmt/base.h:1393:23: error: no matching member function for call to 'format'
1393 | ctx.advance_to(cf.format(*static_cast<qualified_type*>(arg), ctx));
| ~~~^~~~~~
/usr/include/fmt/base.h:1374:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::context>::format_custom_arg<service::migration_badness, fmt::formatter<service::migration_badness>>' requested here
1374 | custom.format = format_custom_arg<
| ^
/home/kefu/dev/scylladb/service/tablet_allocator.cc:170:14: note: in instantiation of function template specialization 'fmt::format_to<fmt::basic_appender<char>, const locator::global_tablet_id &, const locator::tablet_replica &, const locator::tablet_replica &, const service::migration_badness &, 0>' requested here
170 | fmt::format_to(ctx.out(), "{{tablet: {}, {} -> {}, badness: {}", candidate.tablet, candidate.src,
| ^
/home/kefu/dev/scylladb/service/tablet_allocator.cc:161:10: note: candidate function template not viable: 'this' argument has type 'const fmt::formatter<service::migration_badness>', but method is not marked const
161 | auto format(const service::migration_badness& badness, FormatContext& ctx) {
| ^
```
so, in this change, we mark these two `format()` member functions const.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20013
rewrite the function as coroutine to make it easier to read and maintain, following lifetime issues we had and fixed in this function.
The second commit adds a test that drops a table while there is a counter update operation ongoing in the table.
The test reproduces issue https://github.com/scylladb/scylla-enterprise/issues/4475 and verifies it is fixed.
Follow-up to https://github.com/scylladb/scylladb/pull/19948
Doesn't require backport because the fix to the issue was already done and backported. This is just cleanup and a test.
Closesscylladb/scylladb#19982
* github.com:scylladb/scylladb:
db: test counter update while table is dropped
db: coroutinize do_apply_counter_update
Recently, some users have seen "Key size too large" errors in various
places. Cassandra and Scylla impose a 64KB length limit on keys, and
we have known about bugs in this area for a long time - and even had
some translated Cassandra unit tests that cover some of them. But these
tests did not cover all the corner cases and left us with partial and
fragmented knowledge of this problem, spread over many test files and
many issues.
In this patch, we add a single test file, test/cql-pytest/test_key_length.py
which attempts to rigourously explore the various bugs we have with
CQL key length limits. These test aim to reproduce all known bugs in
this area:
* Refs #3017 - CQL layer accepts set values too large to be written to
an sstable
* Refs #10366 - Enforce Key-length limits during SELECT
* Refs #12247 - Better error reporting for oversized keys during INSERT
* Refs #16772 - Key length should be limited to exactly 65535, not less
The following less interesting bug is already covered by many tests so
I decided not to test it again:
* Refs #7745 - Length of map keys and set items are incorrectly limited
to 64K in unprepared CQL
There's also a situation in materialized views and secondary indexes,
where a column that was _not_ a key, now becomes a key, and a length
limit needs to be enforced on it. We already have good test coverage
for this (in test/cql-pytest/test_secondary_index.py and in
test/cql-pytest/test_materialized_view.py), and we have an issue:
* Refs #8627 - Cleanly reject updates with indexed values where value > 64k
All 16 tests added here pass on Cassandra 5 except one that fails on
https://issues.apache.org/jira/browse/CASSANDRA-19270, but 11 of the
tests currently fail on Scylla (6 on #12247, 2 on #10366, 3 on #16772).
It is possible that our decision in #16772 will not be to fix Scylla
to match Cassandra but rather to declare that strict compatibility isn't
needed in this case or even that Cassandra is wrong. But even then,
having these tests which demonstrate the behavior of both Cassandra
and Scylla will be important.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16779
When debugging coredumps some (small, but useful) information is hidden
in the notes of the core ELF file. Add some words about it exists, what
it includes and the thing that is always forgotten -- the way to get one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19962
Commit ad0e6b79 (replica: Remove all_datadir from keyspace config) removed all_datadirs from keyspace config, now it's datadir turn. After this change keyspace no longer references any on-disk directories, only the sstables's storage driver attached to keyspace's tables does.
refs #12707Closesscylladb/scylladb#19866
* github.com:scylladb/scylladb:
replica: Remove keyspace::config::datadir
sstables/storage: Evaluate path for keyspace directory in storage
sstables/storage: Add sstables_manager arg to init_keyspace_storage()
This commit adds OS support for version 6.1 and removes OS support for 5.4
(according to our support policy for versions).
Closesscylladb/scylladb#19992
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
The for_tests constructor has a metrics parameter defaulted to
register_metrics::no, but when delegating to the other constructor, a
hard-coded register_metrics::no is passed. This makes no difference
currently, because all callers use the default and the hard-coded value
corresponds to it. Let's fix it nevertheless to avoid any future
surprises.
Closesscylladb/scylladb#20007
The large_partition_schema() call returns a copy of the "schema_ptr"
object that points to an effectively statically initialized thread_local
"schema" object. The large_partition_schema() call has no bearing on
whether, or when, the "schema" object is constructed, and has no side
effects (other than copying an "lw_shared_ptr" object). Furthermore, the
return value of large_partition_schema() is not used for anything in
promoted_index_read().
This redundant call seems to date back to original commit 3dd079fb7a
("tests: add test for reading parts of a large partition", 2016-08-07).
Remove the call and the variable.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
All lambdas passed to test_using_reusable_sst() and
test_using_reusable_sst_returning() have been converted to future::get()
calls (according to the seastar::thread context that they are now executed
in). None of the lambdas return futures anymore; they all directly return
void or non-void. Therefore, drop futurize_invoke(...).get() around the
lambda invocations in test_using_reusable_sst*().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
For better readability, replace the future::then() chaining (and the
associated manual fiddling with object lifecycles) with future::get() (and
rely on seastar::thread's stack). We're already in seastar::thread
context.
Similarly, replace the future::finally() underlying with_closeable() with
deferred_close(); with the assumption that mutation_reader::close() never
fails (and is therefore safe to call in the "deferred_close" destructor).
This is actually guaranteed, as mutation_reader::close() is marked
"noexcept".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
According to early patch "test/sstable: rewrite test_using_reusable_sst()
with async" in this series, lambdas passed to test_using_reusable_sst()
are invoked:
(a) less importantly here, in seastar::thread context,
(b) more importantly here, futurized (temporarily so).
The test case not_find_key_composite_bucket0() doesn't chain futures;
therefore it needs no conversion to future::get() for purpose (a);
however, we can eliminate its empty future return. Fact (b) will cover for
that, until all such lambdas are converted to direct "void" returns (at
which point we can remove the futurization from
test_using_reusable_sst()).
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
For better readability, replace future::then() chaining with
future::get(). (We're already in seastar::thread context.)
This patch is best viewed with "git show -b".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
According to early patch "test/sstable: rewrite test_using_reusable_sst()
with async" in this series, lambdas passed to test_using_reusable_sst()
are invoked:
(a) less importantly here, in seastar::thread context,
(b) more importantly here, futurized (temporarily so).
The test cases find_key_map(), find_key_set(), find_key_list(),
find_key_composite(), all_in_place() don't chain futures; therefore they
need no conversion to future::get() for purpose (a); however, we can
eliminate their empty future returns. Fact (b) will cover for that, until
all such lambdas are converted to direct "void" returns (at which point we
can remove the futurization from test_using_reusable_sst()).
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
All three lambdas passed to write_and_validate_sst() now use future::get()
rather than future::then() chaining; in other words, the future::get()
calls inside all these seastar::thread contexts have been pushed down to
the lambdas. Change all these lambdas' return types from future<> to void.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The lambda passed to write_and_validate_sst() already runs in
seastar::thread context; replace future::then() chaining with
future::get() calls.
We're going to eliminate the trailing "return make_ready_future<>()"
later.
This patch is best viewed with "git show -W -b".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The lambda passed to write_and_validate_sst() already runs in
seastar::thread context; replace future::then() chaining with
future::get() calls.
We're going to eliminate the trailing "return make_ready_future<>()"
later.
This patch is best viewed with "git show -W -b".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The lambda passed to write_and_validate_sst() already runs in
seastar::thread context; replace future::then() chaining with
future::get() calls.
We're going to eliminate the trailing "return make_ready_future<>()"
later.
This patch is best viewed with "git show -W -b".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
check_component_integrity() does not rely on any deferred close or stop
operations; turn it into a coroutine therefore, for best readability.
This conversion demonstrates particularly well how much the stack eases
coding. We no longer need to artificially extend the lifetime of "tmp"
with a final
.then([tmp] {})
future. Consequently, "tmp" no longer needs to be a shared pointer to an
on-heap "tmpdir" object; "tmp" can just be a "tmpdir" object on the stack.
While at it, eliminate the single-use local objects "s" and "gen", for
movability's sake. (We could use std::move() on these variables, but it
seems easier to just flatten the function calls that produce the
corresponding rvalues into the write_sst_info() argument list.)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The lambda passed to test_using_reusable_sst() is now invoked --
futurized, transitorily -- in seastar::thread context; stop returning an
explicit make_ready_future<>() from the lambda.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
summary_query_fail() does not rely on any deferred close or stop
operations; turn it into a coroutine therefore, for best readability.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
simple_index_read() and composite_index_read() do not rely on any deferred
close or stop operations; turn them into coroutines therefore, for best
readability.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Improve the readability of test_using_reusable_sst() by replacing
future::then() chaining with test_env::do_with_async() and future::get().
Unlike seastar::async(), test_env::do_with_async() restricts its input
lambda to returning "void". Because of this, introduce the variant
test_using_reusable_sst_returning(), based on
test_env::do_with_async_returning(), for lambdas returning non-void. Put
the latter to use in index_read() at once.
Subsequently, we'll gradually convert the lambdas passed to
test_using_reusable_sst() and test_using_reusable_sst_returning() from
returning futures to returning direct values. In order for
test_using_reusable_sst() and test_using_reusable_sst_returning() to cope
with both types of lambdas, wrap the lambdas into futurize_invoke().get().
In the seastar::thread context, future::get() will gracefully block on
genuine futures, and return immediately on direct values that were
futurized on the spot.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Make test_using_working_sst() easier to read by:
(1) replacing test_env::do_with() with seastar::async(),
seastar::defer(), and future::get();
(2) replacing seastar::async() and seastar::defer() with
test_env::do_with_async().
Technically speaking, this change does not perfectly preserve exceptional
behavior. Namely, test_env::do_with() uses future::finally() to link
test_env::stop() to the chain of futures, and future::finally() permits
test_env::stop() itself to throw an exception -- potentially leading to a
seastar::nested_exception being thrown, which would carry both the
original exception and the one thrown by test_env::stop().
Contrarily, the test_env::stop() deferred with seastar::defer() runs in a
destructor, and therefore test_env::stop() had better not throw there.
However, we will assume that test_env::stop() does not throw, albeit not
marked "noexcept". Prior commits 8d704f2532 ("sstable_test_env:
Coroutinize and move to .cc test_env::stop()", 2023-10-31) and
2c78b46c78 ("sstables::test_env: Carry compaction manager on board",
2023-10-31) show that we've considered individual actions in
test_env::stop() not to throw before.
The 128KB stack of seastar::thread (which underlies seastar::async())
should be a tolerable cost in a test case, in exchange for the improved
readability.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
repair_service::insert_repair_meta gets the reference to a table
and passes it to continuations. If the table is dropped in the meantime,
the reference becomes invalid.
Use find_column_family at each table occurrence in insert_repair_meta
instead.
Closesscylladb/scylladb#19953
in fabab2f4, we introduced preemption_source, and added
`SCYLLA_ENABLE_PREEMPTION_SOURCE` preprocessor macro to enable
opt-in the pluggable preemption check.
but CMake building system was not updated accordingly.
so, in this change, let's sync the CMake building system with
`configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19951
This reverts commit c3bea539b6.
Since it breaking offline-installer artifact-tests. Also, it seems that we should have merged it in the first place since we don't need scylla-housekeeping checks for offline-installer
Closesscylladb/scylladb#19976
compaction_manager::perform_compaction does not create task manager
task for compaction if parent_info is set to std::nullopt. Currently,
we always want to create task manager task for compaction.
Remove optional from task info parameters which start compaction.
Track all compactions with task manager.
If perform_compaction gets std::nullopt as a parent info then
the executor won't be tracked by task manager.
Modify storage_group::split call so that it passes empty task_info
instead of nullopt to track split.
With the recently added mv admission control, we can now
test how are the view update backlogs updated and propagated
without relying just on the response delays that it was causing
until now.
This patch adds a test for it, replicating issues scylladb/scylladb#18461
and scylladb/scylladb#18783.
In the test, we start with an empty view update backlog, then perform
a write to it, increasing its backlog and saving the updated backlog
on coordinator, the backlog then drops back to 0, we wait 1s for the
backlog to be gossiped and we perform another write which should
succeed.
Due to scylladb/scylladb#18461, the test would fail because
in both gossip rounds before and after the write, the backlog was empty,
causing the write to be blocked by admission control indefinitely.
Due to scylladb/scylladb#18783, the test would fail because when
the backlog drops back to 0 after the write, the change is never
registered, causing all writes to be blocked as well.
In this patch we add 2 tests for checking that the mv admission control works.
The first one simply checks whether, after increasing the backlog on one node
over the admission control threshold, the following request is rejected with
the error message corresponding to the admission control.
The second one checks whether, after triggering admission control, the entire
user request fails instead of just failing a replica write. This is done by
performing a number of writes, some of which trigger the admission control
and cause retries, then checking if the node that had a large view update backlog
received all the writes. Before, the writes would succeed on enough replicas,
reaching QUORUM, and allowing the user write to succeed and cause no retries,
even though on the replica with a high backlog the write got rejected due to
the backlog size.
To avoid an expensive stack unwind, instead of throwing an error,
we can just return it thanks to the boost::result type that the
affected methods use. The result with an exception needs to be
constructed not implicitly, but with boost::outcome_v2::failure,
because the exception, converted into coordinator_exception_container
can be then converted into both into a successful response_id_type
as well as into a failure.
Currently, when a replica's view update backlog is full, the write is still
sent by the coordinator to all replicas. Because of the backlog, the write
fails on the replica, causing inconsistency that needs to be fixed by repair.
To avoid these inconsistencies, this patch adds a check on the coordinator
for overloaded replicas. As a result, a write may be rejected before being
sent to any replicas and later retried by the user, when the replica is no
longer overloaded.
Fixesscylladb/scylladb#17426
Currently when a partition is deleted from the base table, we generate a
row tombstone update for each one of the view rows in the partition.
When the partition key in the view is the same as the base, maybe in a
different order, this can be done more efficiently - The whole corresponding
view partition can be deleted with one partition tombstone update.
With this commit, when generating view updates, if the update mutation has a
partition tombstone then for the views which have the same partition key
we will generate a partition tombstone update, and skip the individual
row tombstone updates.
Fixesscylladb/scylladb#8199Closesscylladb/scylladb#19338
* github.com:scylladb/scylladb:
mv: skip reading rows when generating partition tombstone update
mv: delete a partition in a single operation when applicable
cql-pytest: move ScyllaMetrics to util file to allow reuse
Add a test that drops a table while there is a counter update operation
ongoing in the table.
The test reproduces issue scylladb/scylla-enterprise#4475 and verifies
it is fixed.
Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.
In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.
This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance. Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.
We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.
The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.
For example, for the following parameters:
params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}
The results are:
Before:
Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}
After:
Overcommit : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
Overcommit : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
So worst shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.
Also, node overcommit for table1 was reduced from 1.81 to 1.02.
The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.
The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).
One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.
Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.
Fixes https://github.com/scylladb/scylladb/issues/16824Closesscylladb/scylladb#19779
* github.com:scylladb/scylladb:
test: perf: tablet_load_balancing: Test with higher shard and tablet counts
tablets: load_balancer: Avoid quadratic complexity when finding best candidate
tablets: load_balancer: Maintain load sketch properly during intra-node migration
tablets: load_balancer: Use "drained" flag
test: perf: tablet_load_balancing: Report load balancer stats
tablets: load_balancer: Move load_balancer_stats_manager to header file
tablets: load_balancer: Split evaluate_candidate() into src and dst part
tablets: load_balancer: Optimize evaluate_candidate()
tablets: load_balancer: Add more statistics
tablets: load_balancer: Track load per table on cluster level
tablets: load_balancer: Track load per table on node level
tablets: load_balancer: Use a single load sketch for tracking all nodes
locator: load_sketch: Introduce populate_dc()
tablets: load_balancer: Modify target load sketch only when emitting migration
locator: load_sketch: Introduce get_most_loaded_shard()
locator: load_sketch: Introduce get_least_loaded_shard()
locator: load_sketch: Optimize pick()/unload()
locator: load_sketch: Introduce load_type
test: perf: tablet_load_balancing: Report total tablet counts
test: perf: tablet_load_balancing: Print run parameters in the single simulation case too
test: perf: tablet_load_balancing: Report time it took to schedule migrations
tablets: load_balancer: Log table load stats after each migration
tablets: load_balancer: Log per-shard load distribution in debug level
tablets: load_balancer: Improve per-table balance
tablets: load_balancer: Extract check_convergence()
tablets: load_balancer: Extract nodes_by_load_cmp
tablets: load_balancer: Maintain tablet count per table
tablets: load_balancer: Reuse src_node_info
test: perf: tablet_load_balancing: Print warnings about bad overcommit
test: perf: tablet_load_balancing: Allow running a single simulation
test: perf: tablet_load_balancing: Report best possible shard overcommit
test: perf: tablet_load_balancing: Report global shard overcommit
This commit adds the 6.0-to-6.1 upgrade guide.
Compared to the previous upgrade guide:
- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
so now there's no scenario that would require the user to follow the validation procedure.
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required
to execute this req,
2. global topo req is executed by topo coordinator, which reads data
attached to the req.
The KS name is among the data attached to the req.
There's a time window between these steps where a to-be-altered KS could
have been DROPped, which results in topo coordinator forever trying to
ALTER a non-existing KS. In order to avoid it, the code has been changed
to first check if a to-be-altered KS exists, and if it's not the case,
it doesn't perform any schema/tablets mutations, but just removes the
global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected
changes, which is due to the fact that the code is written badly and
needs to be refactored - an effort that's already planned under #19126Fixes: #19576
It's only needed to start hints via proxy, but proxy can do it without gossiper argument
Closesscylladb/scylladb#19894
* github.com:scylladb/scylladb:
storage_service: Remote gossiper argument from join_cluster()
proxy: Use remote gossiper to start hints resource manager
hints: Const-ify gossiper references and anchor pointers
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.
Fixesscylladb/scylla-enterprise#4475Closesscylladb/scylladb#19948
A user complained that ScyllaDB is incompatible with Cassandra when it
requires ALLOW FILTERING on a restriction like WHERE x=1 AND y=1 where
x and y are two columns with secondary indexes.
In the tests added in this patch we show that:
1. Scylla *is* compatible with Cassandra when the traditional "CREATE
INDEX" is used - ALLOW FILTERING *is* required in this case in both
Cassandra and Scylla.
2. If SAI is used in Cassandra (CREATE CUSTOM INDEX USING 'SAI'),
indeed ALLOW FILTERING becomes optional. I believe this is incorrect
so I opened CASSANDRA-19795.
These two tests combined show that we're not incompatible with Cassandra,
rather Cassandra's two index implementations are incompatible between
themselves, and Scylla is in fact compatible in this case with Cassadra's
traditional index and not with SAI.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19909
If the source and destination shards picked for migration based on
global tablet balance do not have a good candidate in terms of effect
on per-table balance, the algorithm explores other source shards and
destinations. This has quadratic complexity in terms of shard count in
the worst case, when there are no good candidates.
Since we can have up to ~200 shards, this can slow down scheduling
significantly. I saw total scheduling time of 5 min in the following run:
scylla perf-load-balancing -c1 -m1G --iterations=8 \
--nodes=4 --tablets1=1024 --tablets2=8096 \
--rf1=2 --rf2=3 --shards=256
To improve, change the apprach to first find the best source shard and
then best target shard, sequentially. So it's now linear in terms of
shard count.
After the change, the total scheduling time in that run is down to 4s.
Minimizing source and destination metrics piece-wise minimizes the
combined metric, so badness of the best candidate doesn't suffer after
this change.
Affects only intra-node migration. The code was recording destination
shard as taken and did not un-take it in case we skipped the migration
due to lack of candidates.
Noticed during code review. Impact is minor, since even if this leads
to suboptimal balance, the next scheduling round should fix it.
Also, the source shard was not unloaded, but that should have no
impact on decisions. But to be future-proof, better to maintain the
load accurately in case the algorithm is extended with more steps.
Some of the calls inside the `raft_group0_client::start_operation()` method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it.
This was in particular the case of operations that try to take the ownership of the raft group semaphore (`get_units(semaphore)`) - these waits should be cancelled when the abort source is triggered.
This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3
Fixesscylladb/scylladb#19223Closesscylladb/scylladb#19860
* github.com:scylladb/scylladb:
raft: fix the shutdown phase being stuck
raft: use the abort source reference in raft group0 client interface
There's a counter and a shared future on board, that used to facilitate
start-time barrier synchronization. Now they are not needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When starting, view builder spawns an async background fibers, and upon
its completion each shard needs to wait for other shards to do the same.
This is exactly what cross-shard barrier is about, so instead of
synchronizing via v.b.'s shard-0 instance, use the barrier. This makes
the view_builder::start() shorder and earier to read.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The barrier will be used by next patch to synchronize shards with each
other. When passed to invoke_on_all() lambda like this, each lambda gets
its its copy of the barrier "handler" that maintains shared state across
shards.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
They are executed frequently during tablet scheduling. Currently, they
have time complexity of O(N*log(N)) in terms of shard count. With
large shard counts, that has significant overhead.
This patch optimizes them down to O(log(N)).
Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.
In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.
This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance. Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.
We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.
The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.
For example, for the following parameters:
params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}
The results are:
After:
Overcommit : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
Overcommit : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Before:
Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}
So shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.
Also, node overcommit for table1 was reduced from 1.81 to 1.02.
The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.
The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).
One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.
Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.
Fixes#16824
Will be reused when evaluating different targets for migration in later
stages.
The refactoring drops updating of _stats.for_dc(dc).stop_no_candidates
and we update _stats.for_dc(dc).stop_load_inversion in both cases
where convergence check may fail. The reason is that stat updates must
be outside check_convergence(), since the new use case should not
update those stats (it doesn't stop balancing, just drops
candidates). Propagating the information for distinguishing the two
cases would be a burden. But it's not necessary, since both cases are
actually load inversion cases, one pre-migration the other
post-migration, so we don't need the distinction.
It's actually wrong to increment stop_no_candidates, since there may
still be candidates, it's the load which is inverted.
Rather than maximum per-node shard overcommit. Global shard overcommit
is a better metric since we want to equalize global load not just
per-node load.
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.
This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.
This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3
Fixesscylladb/scylladb#19223
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.
`maybe_rehash` is complimentary and is not strictly required to succeed.
If it fails, it will retry on the next call, but there's no reason
to throw a bad_alloc exception that will fail its caller, since `maybe_rehash`
is called as the final step after the caller has already succeeded with
its action.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We want to call `service_level_controller::do_abort()` in all cases.
The current code (introduced in
535e5f4ae7)
calls do_abort if abort was not requested, however, since
it does so by checking the subscription bool operator,
it would miss the case where abort was already requested
before the subscription took place (in service_level_controller
ctor).
With scylladb/seastar@470b539b1c and
scylladb/seastar@8ecce18c51
we can just unconditionally call the subscription `on_abort`
method, that ensures only-once semantics, even if abort
was already requested at subscription time.
Fixesscylladb/scylladb#19075
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#19929
mapreduce_server was previously coroutinized, but only partially. This
series completes coroutinization and eliminates remaining continuation chains.
None of this code is performance sensitive as it runs at the super-coordinator level
and is amortized over a full scan of the entire table.
No backport needed as this is a cleanup.
Closesscylladb/scylladb#19913
* github.com:scylladb/scylladb:
mapreduce_service: reindent
mapreduce_service: coroutinize retrying_dispatcher::dispatch_to_node()
mapreduce_service: coroutinize dispatch() inner lambda
The Alternator command ListTables is supposed to list actual tables
created with CreateTable, and should list things like materialized views
(created for GSI or LSI) or CDC log tables.
We already properly excluded materialized views from the list - and
had the tests to prove it - but forgot both the exclusion and the testing
for CDC log tables - so creating a table xyz with streams enable would
cause ListTables to also list "xyz_scylla_cdc_log".
This patch fixes both oversights: It adds the code to exclude CDC logs
from the output of ListTables, add adds a test which reproduces the bug
before this fix, and verifies the fix works.
Fixes#19911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19914
In commit bac7c33313 we introduced a new
test for the Alternator "/localnodes" request, checking that a node
that is still joining does not get returned. The tests used what I
thought were "very high" timeouts - we had a timeout of 10 seconds
for starting a single node, and injected a 20 second sleep to leave
us 10 seconds after the first sleep.
But the test failed in one extremely slow run (a debug build on
aarch64), where starting just a single node took more than 15 seconds!
So in this patch I increase the timeouts significantly: We increase
the wait for the node to 60 seconds, and the sleeping injection to
120 seconds. These should definitely be enough for anyone (famous
last words...).
The test doesn't actually wait for these timeouts, so the ridiculously
high timeouts shouldn't affect the normal runtime of this test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19916
There's a test in object_store suite that verifies the contents of a bucket. It does with the plain http request, but unfortunately this doesn't work -- even local minio uses restricted bucket and using plain http request results in 403(Forbidden) error code. Test doesn't check it and continues working with empty list of objects which, in turn, is what it expects to see.
The fix is in using boto3. With it, the acc/secret pair is picked up and listing the bucket finally works.
Closesscylladb/scylladb#19889
* github.com:scylladb/scylladb:
test/object_store: Use boto3.resource to list bucket
test/object_store: Add get_s3_resource() helper
Instead of plain http request, use the power of boto3 package. The
recently added get_s3_resource() facilitates creating one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It creates boto3.resource object that points to endpoint maintained
by s3_server argument (that tests obtain via fixture). This allows
using boto3 to access S3 bucket from local minio server.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
instead of using fmt::runtime, use compile-time format string in
order to detect the bad format string, or missing format arguments,
or arguments which are not formattable at compile time.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19901
* seastar 67065040...a7d81328 (30):
> reactor: Initialize _aio_pollfd later
> abortable_fifo: fix a typo in comment
> net: Expose DNS error category
> pollable_fd_state: use default-generated dtor
> perftune: tune tcp_mem
> scripts/perftune.py: clock source tweaking: special case Amazon and Google KVM virtualizations
> abort_source: subscription: keep callback function alive after abort
> github: disable ccache when building with C++ modules
> github: add enable-ccache input to test.yaml
> pollable_fd_state: Mark destructor protected and make non-virtual
> reactor: Mark .configure() private
> reactor: Set aio_nowait_supported once
> reactor: Add .no_poll_aio to reactor_config
> reactor: Move .max_poll_time on reactor_config
> reactor: Move .task_quota on reactor_config
> reactor: Move .strict_o_direct on reactor_config
> reactor: Move .bypass_fsync on reactor_config
> reactor: Move .max_task_backlog on reactor_config
> reactor: Move .force_io_getevents_syscall on reactor_config
> reactor: Move .have_aio_fsync on reactor_config
> reactor: Move .kernel_page_cache on reactor_config
> reactor: Move .handle_sigint on reactor_config
> reactor_backend: Construct _polling_io from reactor config
> reactor: Move config when constructing
> reactor: Use designated initializers to set up reactor_config
> native-stack: use queue::pop_eventually() in listener::accept()
> abort_source: subscription: allow calling on_abort explicitly
> file: document that close() returns the file object to uninitialized state
> code-cleanup: do not include 'smp.hh' in 'reactor.hh'
> code-cleanup: remove redundant includes of smp.hh
Closesscylladb/scylladb#19912
the parameter names do not match with the ones we are using.
these comments were inherited from Origin, but we failed to update
them accordingly.
in this change, the comments are updated to reflect the function
signatures.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19900
The build_unified.sh script accepts a --build-dir option, which
specifies the directory used for storing temporary files extracted
from tarballs defined by the --pkgs option. When performing parallel
builds of multiple modes, it's crucial that each build uses a unique
build directory. Reusing the same build directory for different modes
can lead to conflicts, resulting in build failures or, more seriously,
the creation of tarballs containing corrupted files.
so, in this change, we specify a different directory for each mode,
so that they don't share the same one.
Refs scylladb/scylladb#2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19905
Simplify the function by converting it to a coroutine.
Note that while the final co_return co_await looks like a loop (and
therefore an await would introduce an O(n) allocation), it really isn't -
we retry at most once.
Currently, delete_atomically can be called with
a list of sstables from mixed prefixes in two cases:
1. truncate: where we delete all the sstables in the table directory
2. tablet cleanup: similar to truncate but restricted to sstables in a
single tablet replica
In both cases, it is possible that sstables in staging (or quarantine)
are mixed with sstables in the base directory.
Until a more comprehensive fix is in place,
(see https://github.com/scylladb/scylladb/pull/19555)
this change just lifts the ban on atomic deletion
of sstables from different prefixes, and acknowledging
that the implementation is not atomic across
prefixes. This is better than crashing for now,
and can be backported more easily to branches
that support tablets so tablet migration can
be done safely in the presence of repair of
tables with views.
Refs scylladb/scylladb#18862
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#19816
This pointer was only needed to pull all the way down the hints resource
manager start() method. It's no longer needed for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
By the time hinst resource manager is started, proxy already has its
remote part initialized. Remote returns const gossiper pointer, but
after previous change hints code can live with it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places in hints code that need gossiper: hist_sender
calling gossiper::is_alive() and endpoint_downtime_not_bigger_than()
helper in manager. Both can live with const gossiper, so the dependency
references and anchor pointers can be restricted to const too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The testcase `test_bloom_filter_reclaim_during_reload` checks the
SSTable manager's `_total_memory_reclaimed` against an expected value to
verify that a Bloom filter was reloaded. However, it does not wait for
the manager to update the variable, causing the check to fail if the
update has not occurred yet. Fix it by making the testcase wait until
the variable is updated to the expected value.
Fixes#19879
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#19883
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.
As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.
In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.
Fixes: scylladb/scylladb#19439
This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.
Closesscylladb/scylladb#19765
* github.com:scylladb/scylladb:
test: regression test for MV crash with tablets during decommission
db/view: drop view updates to replaced node marked as left
when deleting a base partition, in some cases we can update the view by
generating a single partition deletion update, instead of generating a
row deletion update for each of the partition rows.
If this is the case for all the affected views, and there are no other
updates besides deleting the partition, then we can skip reading and
iterating over all the rows, since this won't generate any additional
updates that are not covered already.
Currently when a partition is deleted from the base table, we generate a
row tombstone update for each one of the view rows in the partition.
When the partition key in the view is the same as the base, maybe in a
different order, this can be done more efficiently - The whole corresponding
view partition can be deleted with one partition tombstone update.
With this commit, when generating view updates, if the update mutation has a
partition tombstone then for the views which have the same partition key
we will generate a partition tombstone update, and skip the individual
row tombstone updates.
Fixesscylladb/scylladb#8199
ScyllaMetrics is a useful generic component for retrieving metrics in a
pytest.
The commit moves the implementation from test_shedding.py to util.py to
make it reusable in other tests in cql-pytest.
There are few api::set_foo()-s left in main that are placed in ~~random~~ legacy order. This PR fixes it and makes few more associated cleanups.
refs: #2737Closesscylladb/scylladb#19682
* github.com:scylladb/scylladb:
api: Unset cache_service endpoints on stop
main: Don't ignore set_cache_service() future
api: Move storage API few steps above
api: Register token-metadata API next to token-metadata itsels
api: Do not return zero local host-id
api: Move snitch API registration next to snitch itself
They currently stay registered long after the dependent services get
stopped. There's a need for batch unsetting (scylladb/seastar#1620), so
currently only this explicit listing :(
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The call itself seem to be in wrong place -- there's no "cache service"
also the API uses database and snapshot_ctl to work on. So it deserves
more cleanup, but at least don't throw the returned future<> away.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sequence currently is
sharded<storage_service>.start()
sharded<query_processor>.invoke_on_all(start_remote)
api::set_server_storage_service()
The last two steps can be safely swapped to keep storage service API
next to its service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now API registration happens quite late because it waits storage
service to register its "function" first. This can be done beforeheand
and the t.m. API can be moved to where it should be.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The local host id is read from local token metadata and returned to the
caller as string. The t.m. itself starts with default-constructed host
id vlaue which is updated later. However, even such "unset" host id
value can be rendered as string without errors. This makes the correct
work of the API endpoint depend on the initialization sequence which may
(spoilter: it will) change in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's finally no longer used. Now only sstables storage code "knows" that
keyspace may have its on-disk directory.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the init_keyspace_storage() expects that the caller would
tell it where the ks directory is, but it's not nice as keyspace may
not necessarity keep its sstables in any directory.
This patch moves the directory path evaluation into storage code,
specifically to the lambda that is called for on-disk sstables. The
way directory is evaluated mirrors the one from make_keyspace_config()
that will be removed by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The admission test has a section which tests admission when the
semaphore has inactive reads. This section (and therefore the enire
test) became flaky lately, after a seemingly unrelated seastar upgrade,
which improved timers.
The cause of the flakyness is the permit which is made inactive later:
this permit is created with 0 timeout (times out immediately). For some
time now, when the timeout timer of a permit fires, if the permit is
inactive, it is evicted. This is what makes the test fail: the inactive
read times out and ends up evicting this permit, which is not expected
for the test. The reason this was not a problem before, is that the test
finishes very quickly, usually, before the timer could even be polled by
the reactor. The recent seastar changes changed this and now the timer
sometimes get polled and fires, failing the test.
Fixes: #19801Closesscylladb/scylladb#19859
Alternator allows authentication into the existing CQL roles, but
roles which have the flag "login=false" should be refused in
authentication, and this patch adds the missing check.
The patch also adds a regression test for this feature in the
test/alternator test framework, in a new test file
test/alternator/cql_rbac.py. This test file will later include more
tests of how the CQL RBAC commands (CREATE ROLE, GRANT, REVOKE)
affect authentication and authorization in Alternator.
In particular, these tests need to use not just the DynamoDB API but
also CQL, so this new test file includes the "cql" fixture that allows
us to run CQL commands, to create roles, to retrieve their secret keys,
and so on.
Fixesscylladb/scylladb#19735Closesscylladb/scylladb#19740
Introduce virtual tasks - task manager tasks which cover
cluster-wide operations.
Virtual tasks aren't kept in memory, instead their statuses
are retrieved from associated service when user requests
them with task manager API. From API users' perspective,
virtual tasks behave similarly to regular tasks, but they can
be queried from any node in a cluster.
Virtual tasks cannot have a parent task. They can have
children on each node in a cluster, but do not keep references
to them. So, if a direct child of a virtual task is unregistered
from task manager, it will no longer be shown in parent's
children vector.
virtual_task class corresponds to all virtual tasks in one
group. If users want to list all tasks in a module, a virtual_task
returns all recent supported operations; if they request virtual
task's status - info about the one specified operation is
presented. Time to live, number of tracked operations etc.
depend on the implementation of individual virtual_task.
All virtual_tasks are kept only on shard 0.
Refs: https://github.com/scylladb/scylladb/issues/15852
New feature, no backport needed.
Closesscylladb/scylladb#16374
* github.com:scylladb/scylladb:
docs: describe virtual tasks
db: node_ops: filter topology request entries
test: add a topology suite for testing tasks
node_ops: service: create streaming tasks
node_ops: register node_ops_virtual_task in task manager
service: node_ops: keep node ops module in storage service
node_ops: implement node_ops_virtual_task methods
db: service: modify methods to get topology_requests data
db: service: add request type column to topology_requests
node_ops: add task manager module and node_ops_virtual_task
tasks: api: add virtual task support to get_task_status_recursively
tasks: api: add virtual task support
tasks: api: add virtual tasks support to get_tasks
tasks: add task_handler to hide task and virtual_task differences from user
tasks: modify invoke_on_task
tasks: implement task_manager::virtual_task::impl::get_children
tasks: keep virtual tasks in task manager
tasks: introduce task_manager::virtual_task
as these workflows are scheduled periodically, and if they fail, notifications are sent to the repo's owner. to minimize the surprises to the contributors using github, let's disable these workflows on fork repos.
Closesscylladb/scylladb#19736
* github.com:scylladb/scylladb:
github: do not run clang-tidy as a cron job
github: disable scheduled workflow on forks
We modify `ScyllaCluster.server_start` so that it changes seeds of the
starting node to all currently running nodes. This allows writing tests like
```python
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
However, it disallows writing tests that start multiple clusters. To fix this,
we add the `seeds` parameter to `server_start`.
We also improve the logic in `ScyllaCluster.add_server` to allow writing
tests like
```python
await manager.server_add(expected_error="...")
await manager.server_add()
```
This PR only adds improvements to the `test.py` framework, no need
to backport it.
Closesscylladb/scylladb#19847
* github.com:scylladb/scylladb:
test: scylla_cluster: improve expected_error in add_server
test: scylla_cluster: support more test scenarios
test: scylla_cluster: correctly change seeds in server_start
We make two changes:
- we lease the IP address of a node that failed to boot because of
an expected error,
- we don't log "Cluster ... added ..." when a node fails to boot
because of an expected error.
Here are some examples of tests that don't work with no initial
nodes, but they should work:
1.
```
await manager.server_add(expected_error="...")
await manager.server_add()
```
2.
```
await manager.servers_add(2, expected_error="...")
await manager.servers_add(2)
```
3.
```
s1 = await manager.server_add(start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_add()
```
4.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_start(s2.server_id, expected_error="...")
await manager.servers_add(2)
```
5.
```
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
6.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.servers_add(2)
await manager.server_start(s1.server_id)
await manager.server_start(s2.server_id)
```
In this patch, we make a few improvements to make tests like the ones
presented above work. I tested all the examples above manually.
From now on, servers receive correct seeds if the first servers added
in the test didn't start or failed to boot.
Also, we remove the assertion preventing the creation of a second
cluster. This assertion failed the tests presented above. We could
weaken it to make these tests pass, but it would require some work.
Moreover, we have tests that intentionally create two clusters.
Therefore, we go for the easiest solution and accept that a single
`ScyllaCluster` may not correspond to a single Scylla cluster.
We change seeds in `ScyllaCluster.server_start` to all currently
running nodes. The previous code only pretended that it did it.
After doing this change, writing tests that create multiple clusters
is impossible. To allow it, we add the `seeds` parameter to
`ManagerClient.server_start`. We use it to fix and simplify the only
test that creates two clusters - `test_different_group0_ids`.
system_keyspace::get_topology_request_entries returns entries for
requests which are running or have finished after specified time.
In task manager node ops task set the time so that they are shown
for task_ttl seconds after they have finished.
Add topology_tasks test suite for testing task manager's node ops
tasks. Add TaskManagerClient to topology_tasks for an easy usage
of task manager rest api.
Write a test for bootstrap, replace, rebuild, decommission and remove
top level tasks using the above.
Keep task manager node ops module in storage service. It will be
used to create and manage tasks related to topology changes.
The module is created and registered in storage service constructor.
In storage_service::stop() the module is stopped and so all the remaining
tasks would be unregistered immediately after they are finished.
Modify get_topology_request_state (and wait_for_topology_request_completion),
so that it doesn't call on_internal_error when request_id isn't
in the topology_requests table if require_entry == false.
Add other methods to get topology request entry.
topology_requests table will be used by task manager node ops tasks,
but it loses info about request type, which is required by tasks.
Add request_type column to topology_requests.
Virtual tasks are supported by get_task_status, abort_task and
wait_task.
Task status returned by get_task_status and wait_task:
- contains task_kind to indicate whether it's virtual (cluster) or
regular (node) task;
- children list apart from task_id contains node address of the task.
task_manager/list_module_tasks/{module} starts supporting virtual tasks,
which means that their stats will also be shown for users.
Additional task_kind param is added to indicate whether the task is
virutal (cluster-wide) or regular (node-wide).
Support in other paths will be added in following patches.
Contrary to regular tasks, which are per-operation, virtual tasks
are associated with the whole group of operations. There may be many
operations of each group performed at the same time. Info about each
running operation will be shown to a user through the API.
For virtual tasks, task manager imitates a regular task covering
each operation, but task_manager::tasks aren't actually created
in the memory. Instead, information (e.g. status) about the operation
is retrieved from associated service and passed to a user.
To hide most of the differences from user, task_handler class is created.
Task handler performs appropriate actions depending on task's kind.
However, users need to stay conscious about the kind of task, because:
- get_task_status and wait_task do not unregister virtual tasks;
- time for which a virtual tasks stays in task manager depends
on associated service and tasks' implementation;
- number of virtual task's children shown by get_tasks doesn't have
to be monotonous.
API is modified to use task_handler.
API-specific classes are moved to task_handler.{cc,hh}.
Virtual tasks are kept in task manager together with regular tasks.
All virtual tasks are stored on shard 0.
task_manager::module::make_task is modified to consider virtual
tasks as possible parents.
A virtual task is a new kind of task supported by task manager,
which covers cluster-wide operations.
From users' perspective virtual tasks behave similarly
to task_manager::tasks. The API side of virtual tasks will be
covered in the following patches.
Contrary to task_manager::task, virtual task does not update
its fields proactively. Moreover, no object is kept in memory
for each individual virtual task's operation. Instead a service
(or services) is queried on API user's demand to learn about
the status of running operation. Hence the name.
task_manager::virtual_task is responsible for a whole group
of virtual tasks, i.e. for tracking and generating statuses
of all operations of similar type.
To enable tracking of some kind of operations, one needs to
override task_manager::virtual_task::impl and provide implementations
of the methods returning appropriate information about the operations.
task_manager::virtual_task must be kept on shard 0.
Similarly to task_manager::tasks, virtual tasks can have child tasks,
responsible for tracking suboperations' progress. But virtual tasks
cannot have parents - they are always roots in task trees.
Some methods and structs will be implemented in later patches.
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.
The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.
The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().
But the hard part of this test is the testing:
1. An existing multi-node test for "/localnodes" assummed that right after
a new node was created, it appears on "/localnodes". But after this
patch, it may take a bit more time for the bootstrapping to complete
and the new node to appear in /localnodes - so I had to add a retry loop.
2. I added a test that reproduces the bug fixed here, and verifies its
fix. The test is in the multi-node topology framework. It adds an
injection which delays the bootstrap, which leaves a new node in JOINING
state for a long time. The test then verifies that the new node is
alive (as checked by the REST API), but is not returned by "/localnodes".
3. The new injection for delaying the bootstrap is unfortunately not
very pretty - I had to do it in three places because we have several
code paths of how bootstrap works without repair, with repair, without
Raft and with Raft - and I wanted to delay all of them.
Fixes#19694.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19725
this member function prepares for the backup feature, where the
object to be stored in the object storage is already persisted as a
file on local filesystem. this brings us two benefits:
- with the file, we don't need to accumulate the payloads in memory
and send them in batch, as we do in upload_sink and in
upload_jumbo_sink. this puts less pressure on the memory subsystem.
- with the file, we can read multiple parts in parallel if multpart
upload applies to it, this helps to improve the throughput.
so, this new helper is introduced to help upload an sstable from local
filesystem to the object storage.
Fixes https://github.com/scylladb/scylladb/issues/16287Closesscylladb/scylladb#16387
* github.com:scylladb/scylladb:
s3/client: add client::upload_file()
s3/client: move constants related to aws constraints out
this member function prepares for the backup feature, where the
object to be stored in the object storage is already persisted as a
file on local filesystem. this brings us two benefits:
- with the file, we don't need to accumulate the payloads in memory
and send them in batch, as we do in upload_sink and in
upload_jumbo_sink. this puts less pressure on the memory subsystem.
- with the file, we can read multiple parts in parallel if multpart
upload applies to it, this helps to improve the throughput.
so, this new helper is introduced to help upload an sstable from local
filesystem to the object storage.
Fixes#16287
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
minimum_part_size and aws_maximum_parts_in_piece are
AWS S3 related constraints, they can be reused out of
client::upload_sink and client::upload_jumbo_sink, so
in this change
* extract them out.
* use the user-defined literal with IEC prefix for
better readablity to define minimum_part_size
* add "aws_" prefix to `minimum_part_size` to be
more consistent.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After c1b2b8cb2c /task_manager/wait_task/
does not unregister tasks anymore.
Delete the check if the task was unregistered from test_task_manager_wait.
Check task status in drain_module_tasks to ensure that the task
is removed from task manager.
Fixes: #19351.
Closesscylladb/scylladb#19834
When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted
on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in
both directions.
Not a regression, so a backport isn't strictly necessary.
Closesscylladb/scylladb#19792
* github.com:scylladb/scylladb:
config: specialize from-string conversion for bool
config: wrap boost::lexical_cast<> when converting from strings
If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log.
Related to #15820
The functionality here is to prevent pathological/test cases where a silent system cannot fully process stuff like compaction, GC etc due to things like CL forcing smaller GC windows etc.
Closesscylladb/scylladb#15971
* github.com:scylladb/scylladb:
commitlog: Make max data lifetime runtime-configurable
db::config: Expose commitlog_max_data_lifetime_in_s parameter
commitlog: Add optional max lifetime parameter to cl instance
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes https://github.com/scylladb/scylladb/issues/19722Closesscylladb/scylladb#19717
* github.com:scylladb/scylladb:
boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
sstables: do not reload components of unlinked sstables
sstables/sstables_manager: introduce on_unlink method
`filesystem_storage` methods frequently call `sync_directory()`, for the sake of flushing (sync'ing) a directory. `sync_directory()` always brackets the sync with open and close, and given that most `sync_directory()` calls target the sstable base directory, those repeated opens and closes are considered wasteful. Rework the `filesystem_storage::_dir` member (from a mere pathname) so that it stand for an `opened_directory` object, which keeps the sstable base directory open, for the purpose of repeated sync'ing.
Resolves#2399.
Closesscylladb/scylladb#19624
* github.com:scylladb/scylladb:
sstables/storage: synch "dst_dir" more leanly in create_links_common()
sstables/storage: close previous directory asynchronously upon dir change
sstables/storage: futurize change_dir_for_test()
sstables/storage: sync through "opened_directory" in filesystem...::move()
sstables/storage: sync through "opened_directory" in the "easy" cases
sstables/storage: introduce "opened_directory" class
Current upgrade dtest rely on a ccm node function to
get_highest_supported_sstable_version() that looks for
r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files.
Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default
and there is no cluster feature for it. Thus get_highest_supported_sstable_version()
returns an empty list resulting in the upgrade tests failures.
This change introduces a seperate API path that returns the highest
supported sstable format (one of la, mc, md, me) by a scylla node.
Fixesscylladb/scylladb#19772
Backports to 6.0 and 6.1 required. The current upgrade test in dtest
checks scylla upgrades up to version 5.4 only. This patch is a
prerequisite to backport the upgrade tests fix in dtest.
Closesscylladb/scylladb#19787
We already have code to return min() for
the minimum and maximum tokens in long_token()
and raw(), so instead of using code to return
it, just make sure to set it in the _data member.
Note that although this change affect serialization,
the existing codebase ignores the deserialized bytes
and places a constant (0 before this patch, or min()
with it) in _data for non-key (minumum or maximum) tokens.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Users outside of the token module don't
need to mess with the token::kind.
They can only create key tokens.
Never, minimum or maximum tokens, with a particular
datya value.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
sizeof(dht::token) is only 16 bytes and therefore
it can be passed with 2 registers.
There is no sense in defining minimum_token
and maximum_token out of line, returning a token&
to statically allocated values that require memory
access/copy, while the only call sites that needs
to point to the static min/max tokens are in
dht::ring_position_view.
Instead, they can be defined inline as constexpr
functions and return their const values.
Respectively, define token ctors and methods
as constexpr where applicable (and noexcept while at it
where applicable)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make sure to always initalize the _data member
to 0 for non-key (minimum or maximum) tokens.
This allows to simplify the equality operator
that now doesn't need to rely on `operator<=>`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The is_minimum/is_maximum predicates are more
efficient than comparing the the m{minimum,maximum}_token
values, respectrively. since the is_* functions
need to check only the token kind.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Token comparisons are abundant.
The equality operator is defined inline
in dht/token.hh by calling `t1 <=> t2`,
and so is `tri_compare_raw`, which `operator<=>`
calls in the common path, but `operator<=>` itself
is defined out of line, losing the benefits of inlining.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the rolling restart to avoid spurious driver reconnects.
This can be eventually reverted once the scylladb/python-driver#295 is fixed.
Fixesscylladb/scylladb#19154Closesscylladb/scylladb#19771
* github.com:scylladb/scylladb:
test: raft: fix the flaky `test_raft_recovery_stuck`
test: raft: code cleanup in `test_raft_recovery_stuck`
In v4 of scylladb/scylladb#19598 the last commit of the patch was replaced but this change missed merge so submitting it in a separate patch.
In the current patch, the original functions class correctly marks methods as const where appropriate, and the instance() method now returns a const object. This ensures protection against accidental modifications, as all changes must go through the change_batch object.
Since the functions_changer class was intended to serve the same purpose, it is now redundant. Therefore, we are reverting the commit that introduced it.
Relates scylladb/scylladb#19153Closesscylladb/scylladb#19647
* github.com:scylladb/scylladb:
cql3: functions: replace template with std::function in with_udf_iter()
cql3: functions: improve functions class constness handling
Revert "cql3: functions: make modification functions accessible only via batch class"
Replaced the old `read_barrier` helper from "test/pylib/util.py"
by the new helper from "test/pylib/rest_client.py" that is calling
the newly introduced direct REST API.
Replaced in all relevant tests and decommissioned the old helper.
Introduced a new helper `get_host_api_address` to retrieve the host API
address - which in come cases can be different from the host address
(e.g. if the RPC address is changed).
Fixes: scylladb/scylladb#19662Closesscylladb/scylladb#19739
filesystem_storage::create_links_common() runs on directories that
generally differ from "_dir", thus, we can't replace its sync_directory()
calls with _dir.sync(). We can still use a common (temporary)
"opened_directory" object for synching "dst_dir" three times, saving two
open and two close operations.
This patch is best viewed with "git show -W".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In "filesystem_storage", change_dir_for_test() and move() replace "_dir"
with "opened_directory(new_dir)" using the move assignment operator.
Consequently, the file descriptor underlying "_dir" is closed
synchronously as a part of object destruction.
Expose the async file::close() function through "opened_directory".
Introduce filesystem_storage::change_dir() as a common async workhorse for
both change_dir_for_test() and move(). In change_dir(), close the old
directory asynchronously.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Currently change_dir_for_test() is synchronous. Make it return a future,
so that we can use async operations in change_dir_for_test() overrides.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Near the end of filesystem_storage::move(), we sync both the old
directory, and the new directory, if "delay_commit" is null. At that
point, the new directory is just "_dir"; call _dir.sync() instead of
sync_directory().
This patch is best viewed with "git show -W".
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Replace
sst.sstable_write_io_check(sync_directory, _dir.native())
with
_dir.sync(sst._write_error_handler)
Also replace the explicit (but still relatively "easy")
open_checked_directory() + flush() + flush() operations in
filesystem_storage::seal() with two _dir.sync() calls.
Because filesystem_storage::create_links_common() is marked "const", we
need to declare "_dir" mutable.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
"filesystem_storage::_dir" is currently of type "std::filesystem::path".
Introduce a new class called "opened_directory", and change the type of
"_dir" to the new class "opened_directory".
"opened_directory" keeps the directory open, and offers synchronization on
that open directory (i.e., without having to reopen the directory every
time). In subsequent patches, that will be put to use.
The opening and closing of the wrapped directory cannot easily be handled
explicitly in the "filesystem_storage" member functions.
(
Namely, test::store() and test::rewrite_toc_without_scylla_component()
-- both in "test/lib/sstable_utils.hh" -- perform "open -> ... -> seal"
sequences, and such a sequence may be executed repeatedly. For example,
sstable_directory_shared_sstables_reshard_correctly()
[test/boost/sstable_directory_test.cc] does just that; it "reopens" the
"filesystem_storage" object repeatedly.
)
Rather than trying to restrict the order of "filesystem_storage" member
function calls, replace the "opened_directory" object with a new one
whenever the directory pathname is re-set; namely in
filesystem_storage::change_dir_for_test() and filesystem_storage::move().
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
In 6e79d64, the behavior of `manager::too_many_in_flight_hints_for()`
was accidentally modified. It remained unnoticed for some time
and then fixed. In this commit, we add a test verifying that
the concurrency of hints being written to disk is indeed limited
and the limitations are imposed properly.
Refs scylladb/scylladb#17636Fixesscylladb/scylladb#17660Closesscylladb/scylladb#19741
* github.com:scylladb/scylladb:
db/hints: Verify that Scylla limits the concurrency of written hints
db/hints: Coroutinize `hint_endpoint_manager::store_hint()`
db/hints: Move a constant value to the TU it's used in
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes#19722
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added a new method, on_unlink() to the sstable_manager. This method is
now used by the sstable to notify the manager when it has been unlinked,
enabling the manager to update its bookkeeping as required. The
on_unlink method doesn't do anything yet but will be updated by the next
patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
in 3c7af287, cqlsh's reloc package was marked as "noarch", and its
filename was updated accordingly in `configure.py`, so let's update
the CMake building system accordingly.
this change should address the build failure of
```
08:48:14 [3325/4124] Generating ../Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14 FAILED: Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz /jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14 cd /jenkins/workspace/scylla-master/scylla-ci/scylla/build/dist && /usr/bin/cmake -E copy /jenkins/workspace/scylla-master/scylla-ci/scylla/tools/cqlsh/build/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz /jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14 Error copying file "/jenkins/workspace/scylla-master/scylla-ci/scylla/tools/cqlsh/build/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz" to "/jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz".
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19710
In some cases, the S3 server will not know about a certain build and
any attempt to open a coredump which was generated by this build will
fail, because the S3 server returns an empty/illegal response.
There is already a bypass for missing package-url in the S3 server
response, but this doesn't help in the case when the response is also
missing other metadata, like build-id and version info.
Extend this existig mechanism with a new --scylla-package-url flag,
which provides complete bypass. When provided, the S3 server will not be
queried at all, instead the package is downloaded from the link and
version metadata is extracted from the package itself.
Closesscylladb/scylladb#19769
* seastar 908ccd93...67065040 (44):
> metrics: Use this_shard_id unconditionally
> sstring: prevent fmt from formatting sstring as a sequence
> coding style: allow lines up to 160 chars in length
> src/core: remove unnecessary includes
> when_all: stop using deprecated std::aligned_union_t
> reactor: respect preempt requests in debug mode
> core: fix -Wunused-but-set-variable
> gate: add try_hold
> sstring: declare nested type with typename
> rpc: pass start time to `wait_for_reply()` which accepts `no_wait_type`
> scripts/perftune.py: get rid of "SyntaxWarning: invalid escape sequence"
> scripts/perftune.py: add support for tweaking VLAN interfaces
> scripts/perftune.py: improve discovery of bond device slaves
> scripts/perftune.py: refactor __learn_slaves() function
> code-cleanup: add missing header guards
> code-cleanup: remove redundant includes of 'reactor.hh'
> code-cleanup: explicitly depend on io_desc.hh
> scripts/perftune.py: aRFS should be disabled by default in non-MQ mode
> code-cleanup: remove unneeded includes of fair_queue.hh
> docker: fix mount of install-dependencies
> code-cleanup: remove redundant includes of linux-aio.hh
> fstream: reformat the doxygen comment of make_file_input_stream()
> iostream: use new-style consumer to implement copy()
> stall-analyser: use 0 for default value of --minimum
> reactor: fix crash during metrics gathering
> build: run socket test with linux-aio reactor backend
> test: Add testing of connect()-ion abort ability
> linux_perf_event: exclude_idle only on x86_64
> linux_perf_event: add make_linux_perf_event
> stall-analyser: gracefully handle empty input
> shared_token_bucket: resolve FIXME
> io_tester: ensure that file object is valid when closing it
> tutorial.md: fix typo in Dan Kegel's name
> test,rpc: Extend simple ping-pong case
> rpc: Calculate delay and export it via metrics
> rpc: Exchange handler duration with server responses
> rpc: Track handler execution time
> rpc: Fix hard-coded constants when sending unknown verb reply
> reactor: Unfriend alien and smp queues
> reactor: Add and use stopped() getter
> reactor: Generalize wakeup() callers
> file: Use lighter access to map of fs-info-s
> file: Fix indentation after previous patch
> file: Don't return chain of ready futures from make_file_impl
Closesscylladb/scylladb#19780
Pass origin when opening the sstable from the writer and store it in the
sstable object. This will make the origin available for the entire write
path.
Closesscylladb/scylladb#19721
* github.com:scylladb/scylladb:
sstables: use _origin in write path
sstable::open_sstable: pass and store origin
This PR adds support for aborting index reads from within `index_consume_entry_context::consume_input` when the server is being stopped. The abort source is now propagated down to the `index_consume_entry_context`, making it available for `consume_input` to check if an abort has been requested. If an abort is detected, `consume_input` will throw an exception to stop the index read operation.
Closesscylladb/scylladb#19453
* github.com:scylladb/scylladb:
test/boost: test abort behaviour during index read
sstables/index_reader: stop consuming index when abort has been requested
sstables::index_consume_entry_context: store abort_source
sstable: drop old filter only after the new filter is built during rebuild
sstables/sstables_manager: store abort_source in sstable_manager
replica/database: pass abort_source to database constructor
The yaml/json representation for bool is true/false, but boost::lexical_cast
is 1/0. Specialize bool conversion to accept true/false (for yaml/json
compatibilty) and 1/0 (for backward compatibility). This provides
round-trip conversion for bool configs in system.config.
Configuration uses boost::lexical_cast to convert strings to native
values (e.g. bools/ints). However, boost::lexical_cast doesn't
recognize true/false for bool. Since we can't change boost::lexical_cast,
replace it with a wrapper that forwards directly to boost::lexical_cast.
In the next step, we'll specialize it for bool.
Currently the guard does not account correctly for ongoing operation if semaphore acquisition fails. It may signal a semaphore when it is not held.
Should be backported to all supported versions.
Closesscylladb/scylladb#19699
* github.com:scylladb/scylladb:
test: add test to check that coordinator lwt semaphore continues functioning after locking failures
paxos: do not signal semaphore if it was not acquired
In 6e79d64, the behavior of `manager::too_many_in_flight_hints_for()`
was accidentally modified. It remained unnoticed for some time
and then fixed. In this commit, we add a test verifying that
the concurrency of hints being written to disk is indeed limited
and the limitations are imposed properly.
since fmt 11, it is required that the format() to be const, otherwise
its caller in fmt library would not be able to call it. and compile
would fail like:
```
/home/kefu/.local/bin/clang++ -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/abseil -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT locator/CMakeFiles/scylla_locator.dir/RelWithDebInfo/abstract_replication_strategy.cc.o -MF locator/CMakeFiles/scylla_locator.dir/RelWithDebInfo/abstract_replication_strategy.cc.o.d -o locator/CMakeFiles/scylla_locator.dir/RelWithDebInfo/abstract_replication_strategy.cc.o -c /home/kefu/dev/scylladb/locator/abstract_replication_strategy.cc
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.cc:9:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:16:
In file included from /home/kefu/dev/scylladb/gms/inet_address.hh:11:
In file included from /usr/include/fmt/ostream.h:23:
In file included from /usr/include/fmt/chrono.h:23:
In file included from /usr/include/fmt/format.h:41:
/usr/include/fmt/base.h:1393:23: error: no matching member function for call to 'format'
1393 | ctx.advance_to(cf.format(*static_cast<qualified_type*>(arg), ctx));
| ~~~^~~~~~
/usr/include/fmt/base.h:1374:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::context>::format_custom_arg<locator::vnode_effective_replication_map::factory_key, fmt::formatter<locator::vnode_effective_replication_map::factory_key>>' requested here
1374 | custom.format = format_custom_arg<
| ^
/home/kefu/dev/scylladb/seastar/include/seastar/util/log.hh:299:33: note: in instantiation of function template specialization 'fmt::format_to<seastar::internal::log_buf::inserter_iterator &, locator::vnode_effective_replication_map::factory_key &, const void *, 0>' requested here
299 | return fmt::format_to(it, fmt.format, std::forward<Args>(args)...);
| ^
/home/kefu/dev/scylladb/seastar/include/seastar/util/log.hh:428:9: note: in instantiation of function template specialization 'seastar::logger::log<locator::vnode_effective_replication_map::factory_key &, const void *>' requested here
428 | log(log_level::debug, std::move(fmt), std::forward<Args>(args)...);
| ^
/home/kefu/dev/scylladb/locator/abstract_replication_strategy.cc:561:18: note: in instantiation of function template specialization 'seastar::logger::debug<locator::vnode_effective_replication_map::factory_key &, const void *>' requested here
561 | rslogger.debug("create_effective_replication_map: found {} [{}]", key, fmt::ptr(erm.get()));
| ^
/home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:471:10: note: candidate function template not viable: 'this' argument has type 'const fmt::formatter<locator::vnode_effective_replication_map::factory_key>', but method is not marked const
471 | auto format(const locator::vnode_effective_replication_map::factory_key& key, FormatContext& ctx) {
| ^
1 error generated.
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19768
Now that the origin is available inside the sstable object, no need to
pass it to the methods called in the write path.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Pass origin when opening the sstable from the writer and store it in the
sstable object. This will make the origin available for the entire write
path.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added a new boost test, index_reader_test, with a testcase to verifyi
the abort behaviour during an index read using
index_consume_entry_context.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When an abort is requested, stop further reading of the index file and
throw and exception from index_consume_entry_context::process_state().
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Store abort source inside sstables::index_consume_entry_context, so that
the next patch can implement cancelling the index read when abort is
requested.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
sstable::maybe_rebuild_filter_from_index drops the existing filter first
and then rebuilds the new filter as the method is only called before the
sstable is sealed. But to make the index read abortable, the old filter
can be dropped only after the new filter is built so that in case if the
index consumer gets aborted, we still have the old filter to write to
disk.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Add a new member that stores the abort_source. This can later be used by
the sstables to check if an abort has been requested. Also implement
sstables_manager::get_abort_source() that returns a const reference to
the abort source.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This is in preparation for the following patch that adds abort_source
variable to the sstables_manager.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When a node that is permanently down is replaced, it is marked as "left"
but it still can be a replica of some tablets. We also don't keep IPs of
nodes that have left and the `node` structure for such node returns an
empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired
with the left node might decide to generate a view update. Because
storage proxy still uses IPs and not host IDs, it needs to obtain the
view replica's IP and tell the storage proxy to write a view update to
that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to
write a hint towards this address - hinted handoff on the other hand
operates on host IDs and not IPs, so it attempts to translate the IP
back, which triggers an assertion as there is no replica with IP
0.0.0.0.
As a quick workaround for this issue just drop view updates towards
nodes which seem to have IPs that are all zeros. It would be more proper
to keep the view updates as hints and replay them later to the new
paired replica, but achieving this right now would require much more
significant changes. For now, fixing a crash is more important than
keeping views consistent with base replicas.
Fixes: scylladb/scylladb#19439
The guard signals a semaphore during destruction if it is marked as
locked, but currently it may be marked as locked even if locking failed.
Fix this by using semaphore_units instead of managing the locked flag
manually.
Fixes: https://github.com/scylladb/scylladb/issues/19698
as these workflows are scheduled periodically, and if they fail,
notifications are sent to the repo's owner. to minimize the surprises
to the contributors using github, let's disable these workflows on
fork repos.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
there's not a 1:1 relationship between compaction group count and
tablet count. a tablet replica has a storage group instance, which
may map to multiple compaction groups during split mode.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Until now, the constant `HINT_FILE_WRITE_TIMEOUT` was
declared as a static member of `db::hints::manager`.
However, the constant is only ever used in one
translation unit, so it makes more sense to move it
there and not include boilerplate in a header.
If set, any remaining segment that has data older than this threshold
will request flushing, regardless of data pressure. I.e. even a system
where nothing happends will after X seconds flush data to free up the
commit log.
2024-07-09 12:30:48 +00:00
891 changed files with 29830 additions and 9375 deletions
seastar::metrics::description("number of rows read and dropped during filtering operations")),
seastar::metrics::make_counter("batch_item_count",seastar::metrics::description("The total number of items processed across all batches"),{op("BatchWriteItem")},
seastar::metrics::make_counter("batch_item_count",seastar::metrics::description("The total number of items processed across all batches"),{op("BatchGetItem")},
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"split_output",
"description":"true if the output of the major compaction should be split in several sstables",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/backup",
"operations":[
{
"method":"POST",
"summary":"Starts copying SSTables from a specified keyspace to a designated bucket in object storage",
"type":"string",
"nickname":"start_backup",
"produces":[
"application/json"
],
"parameters":[
{
"name":"endpoint",
"description":"ID of the configured object storage endpoint to copy sstables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"bucket",
"description":"Name of the bucket to backup sstables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Name of a keyspace to copy sstables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of a snapshot to copy sstables from",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/restore",
"operations":[
{
"method":"POST",
"summary":"Starts copying SSTables from a designated bucket in object storage to a specified keyspace",
"type":"string",
"nickname":"start_restore",
"produces":[
"application/json"
],
"parameters":[
{
"name":"endpoint",
"description":"ID of the configured object storage endpoint to copy SSTables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"bucket",
"description":"Name of the bucket to read SSTables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of a snapshot to copy SSTables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Name of a keyspace to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Name of a table to copy SSTables to",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
help="Specify the artifacts to build, invoke {} with --list-artifacts to list all available artifacts, if unspecified all artifacts are built".format(sys.argv[0]))
arg_parser.add_argument('--with-seastar',action='store',dest='seastar_path',default='seastar',help='Path to Seastar sources')
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.