Before this commit, there doesn't seem to have been a test verifying that
starting and shutting down Scylla behave correctly when the configuration
option `view_building` is set to false. In these changes, we add one.
(cherry picked from commit d1f960eee2)
In these changes, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:
(a) The following things happened in order:
1. The view builder would be constructed.
2. Right after that, a deferred lambda would be created to stop the
view builder during shutdown.
3. group0_service would be started.
4. A deferred lambda stopping group0_service would be created right
after that.
5. The view builder would be started.
(b) Because the view builder depends on group0_client, it couldn't be
started before starting group0_service. On the other hand, other
services depend on the view builder, e.g. the stream manager. That
makes changing the order of initialization a difficult problem,
so we want to avoid doing that unless we're sure it's the right
choice.
(c) Since the view builder uses group0_client, there was a possibility
of running into a segmentation fault issue in the following
scenario:
1. A call to `view_builder::mark_view_build_success()` is issued.
2. We stop group0_service.
3. `view_builder::mark_view_build_success()` calls
`announce_with_raft()`, which leads to a use-after-free because
group0_service has already been destroyed.
This very scenario took place in scylladb/scylladb#20772.
Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534.
We reverted that change in the previous commit. These changes are the
second attempt to the problem where we want to solve it in a safer manner.
The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.
Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.
Fixesscylladb/scylladb#20772
(cherry picked from commit 06ce976370)
The patch solved a problem related to an initialization order
(scylladb/scylladb#20772), but we ran into another one: scylladb/scylladb#21534.
After moving the initialization of group0_service, it ended up being destroyed
AFTER the CDC generation service would. Since CDC generations are accessed
in `storage_service::topology_state_load()`:
```
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
co_await _cdc_gens.local().handle_cdc_generation(gen_id);
```
we started getting the following failure:
```
Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed.
```
We're reverting the patch to go back to a more stable version of Scylla
and in the following commit, we'll solve the original issue in a more
systematic way.
This reverts commit 7bad8378c7.
(cherry picked from commit a5715086a4)
Add the test file name to `ScyllaClusterManager` log file names alongside the test function name.
This avoids race conditions when tests with the same function names are executed simultaneously.
Fixesscylladb/scylladb#21807
Backport: not needed since this is a fix in the testing scripts.
Closesscylladb/scylladb#22192
(cherry picked from commit 2f1731c551)
Closesscylladb/scylladb#22249
This series attempts to get read of flakiness in cache_algorithm_test by solving two problems.
Problem 1:
The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...
Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.
Problem 2:
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.
This can cause the test to fail spuriously.
With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.
This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
Fixesscylladb/scylladb#21536
(cherry picked from commit 1fffd976a4)
(cherry picked from commit 6caaead4ac)
Parent PR: scylladb/scylladb#21948Closesscylladb/scylladb#22228
* github.com:scylladb/scylladb:
cache_algorithm_test: harden against stats being confused by background activity
cache_algorithm_test: fix a use of an uninitialized variable
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.
Added a testcase to verify the fix.
Fixes#21887
This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.
Closesscylladb/scylladb#21895
* github.com:scylladb/scylladb:
sstables_manager: reclaim memory from sstables on unlink
sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
sstables: introduce disable_component_memory_reload()
sstables_manager: log sstable name when reclaiming components
(cherry picked from commit d4129ddaa6)
Closesscylladb/scylladb#21998
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.
This can cause the test to fail spuriously.
With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.
This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
(cherry picked from commit 6caaead4ac)
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...
Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.
(cherry picked from commit 1fffd976a4)
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.
This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.
A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.
Fixes: scylladb/scylladb#20082Closesscylladb/scylladb#21600
(cherry picked from commit 6c90a25014)
Closesscylladb/scylladb#21822
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.
Fixesscylladb/scylladb#20754
The PR should be backported to 6.2, this version has view builder on group0.
Closesscylladb/scylladb#21708
* github.com:scylladb/scylladb:
test/topology_custom/test_view_build_status: add reproducer
service/topology_coordinator: migrate view builder only if all nodes are up
(cherry picked from commit def51e252d)
Closesscylladb/scylladb#21850
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.
The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.
Fixesscylladb/scylladb#21763Closesscylladb/scylladb#21773
(cherry picked from commit 373855b493)
Closesscylladb/scylladb#21893
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.
Fixes#20030
This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.
Closesscylladb/scylladb#21582
* github.com:scylladb/scylladb:
compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
compaction_group: rename `update_main_sstable_list_on_compaction_completion`
compaction_group: update maintenance sstable set on scrub compaction completion
compaction_group: store table::sstable_list_builder::result in replacement_desc
table::sstable_list_builder: remove old sstables only from current list
table::sstable_list_builder: return removed sstables from build_new_list
(cherry picked from commit 58baeac0ad)
Closesscylladb/scylladb#21790
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.
Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency
Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions
Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios
Fixesscylladb/scylladb#21724
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21726
(cherry picked from commit 65949ce607)
Closesscylladb/scylladb#21734
Before these changes, we didn't wait for the materialized views to
finish building before writing to the base table. That led to generating
an additional view update, which, in turn, led to test failures.
The scenario corresponding to the summary above looked like this:
1. The test creates an empty table and MVs on it.
2. The view builder starts, but it doesn't finish immediately.
3. The test performs mutations to the base table. Since the views
already exist, view updates are generated.
4. Finally, the view builder finishes. It notices that the base
table has a row, so it generates a view update for it because
it doesn't notice that we already have data in the view.
We solve it by explicitly waiting for both views to finish building
and only then start writing to the base table.
Additionally, we also fix a lifetime issue of the row the test revolves
around, further stabilizing CI.
Fixes https://github.com/scylladb/scylladb/issues/20889
Backport: These changes have no semantic effect on the codebase,
but they stabilize CI, so we want to backport them to the maintained
versions of Scylla.
Closesscylladb/scylladb#21632
* github.com:scylladb/scylladb:
test/boost/view_schema_test.cc: Increase TTL in test_view_update_generating_writetime
test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime
(cherry picked from commit 733a4f94c7)
Closesscylladb/scylladb#21640
tablet_repair_task_impl keeps a vector of tablet_repair_task_meta,
each of which keeps an effective_replication_map_ptr. So, after
the task completes, the token metadata version will not change for
task_ttl seconds.
Implement tablet_repair_task_impl::release_resources method that clears
tablet_repair_task_meta vector when the task finishes.
Set task_ttl to 1h in test_tablet_repair to check whether the test
won't time out.
Fixes: #21503.
Closesscylladb/scylladb#21504
(cherry picked from commit 572b005774)
Closesscylladb/scylladb#21622
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.
This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2
Fixes: scylladb/scylladb#21096Closesscylladb/scylladb#21609
(cherry picked from commit acd643bd75)
Closesscylladb/scylladb#21764
Currently, task_manager_module::abort_all_repairs marks top-level repairs as aborted (but does not abort them) and aborts all existing shard tasks.
A running repair checks whether its id isn't contained in _aborted_pending_repairs and then proceeds to create shard tasks. If abort_all_repairs is executed after _aborted_pending_repairs is checked but before shard tasks are created, then those new tasks won't be aborted. The issue is the most severe for tablet_repair_task_impl that checks the _aborted_pending_repairs content from different shards, that do not see the top-level task. Hence the repair isn't stopped but it creates shard repair tasks on all shards but the one that initialized repair.
Abort top-level tasks in abort_all_repairs. Fix the shard on which the task abort is checked.
Fixes: #21612.
Needs backport to 6.1 and 6.2 as they contain the bug.
Closesscylladb/scylladb#21616
* github.com:scylladb/scylladb:
test: add test to check if repair is properly aborted
repair: add shard param to task_manager_module::is_aborted
repair: use task abort source to abort repair
repair: drop _aborted_pending_repairs and utilize tasks abort mechanism
repair: fix task_manager_module::abort_all_repairs
(cherry picked from commit 5ccbd500e0)
Closesscylladb/scylladb#21642
Alternator's "/localnodes" HTTP requests is supposed to return the list
of nodes in the local DC to which the user can send requests.
Before commit bac7c33313 we used the
gossiper is_alive() method to determine if a node should be returned.
That commit changed the check to is_normal() - because a node can be
alive but in non-normal (e.g., joining) state and not ready for
requests.
However, it turns out that checking is_normal() is not enough, because
if node is stopped abruptly, other nodes will still consider it "normal",
but down (this is so-called "DN" state). So we need to check **both**
is_alive() and is_normal().
This patch also adds a test reproducing this case, where a node is
shut down abruptly. Before this patch, the test failed ("/localnodes"
continued to return the dead node), and after it it passes.
Fixes#21538
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21540
(cherry picked from commit 7607f5e33e)
Closesscylladb/scylladb#21634
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.
Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.
Fixes: scylladb/scylladb#19959Closesscylladb/scylladb#21477
(cherry picked from commit 92db2eca0b)
Closesscylladb/scylladb#21556
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them by continue with shutdown.
stop_ongoing_compactions, in particular, currently returns the status
of stopped compaction tasks from `stop_tasks`, but still all tasks
must be stopped after it, even if they failed, so assert that
and ignore the errors.
Fixes scylladb/scylladb#21159
* Needs backport to 6.2 and 6.1, as commit 8cc99973eb causes handles storage that might cause compaction tasks to fail and eventually terminate on shudown when the exceptions are thrown in noexcept context in the deferred stop destructor body
(cherry picked from commit e942c074f2)
(cherry picked from commit d8500472b3)
(cherry picked from commit c08ba8af68)
(cherry picked from commit a7a55298ea)
(cherry picked from commit 6cce67bec8)
Refs #21299Closesscylladb/scylladb#21434
* github.com:scylladb/scylladb:
compaction_manager: stop: await _stop_future if engaged
compaction_manager: really_do_stop: assert that no tasks are left behind
compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
compaction/compaction_manager: stop_tasks(): unlink stopped tasks
compaction/compaction_manager: make _tasks an intrusive list
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.
Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
(cherry picked from commit e942c074f2)
In scylladb/scylladb#19745, view_builder was migrated to group0 and since then it is dependant on group0_service.
Because of this, group0_service should be initialized/destroyed before/after view_builder.
This patch also adds error injection to `raft_server_with_timeouts::read_barrier`, which does 1s sleep before doing the read barrier. There is a new test which reproduces the use after free bug using the error injection.
Fixesscylladb/scylladb#20772scylladb/scylladb#19745 is present in 6.2, so this fix should be backported to it.
Closesscylladb/scylladb#21471
* github.com:scylladb/scylladb:
test/boost/secondary_index_test: add test for use after free
api/raft: use `get_server_with_timeouts().read_barrier()` in coroutines
main,cql_test_env: start group0_service before view_builder
(cherry picked from commit 7021efd6b0)
Closesscylladb/scylladb#21506
For performance reasons, mutation_partition_v2::maybe_drop(), and by extension
also mutation_partition_v2::apply_monotonically(mutation_partition_v2&&)
can evict empty row entries, and hence change the continuity of the merged
entry.
For checking that apply_to_incomplete respects continuity,
test_apply_to_incomplete_respects_continuity obtains the continuity of
the partition entry before and after apply_to_incomplete by calling
e.squashed().get_continuity(). But squashed() uses apply_monotonically(),
so in some circumstances the result of squashed() can have smaller
continuity than the argument of squashed(), which messes with the thing
that the test is trying to check, and causes spurious failures.
This patch changes the method of calculating the continuity set,
so that it matches the entry exactly, fixing the test failures.
Fixesscylladb/scylladb#13757Closesscylladb/scylladb#21459
(cherry picked from commit 35921eb67e)
Closesscylladb/scylladb#21497
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.
So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.
This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.
It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.
Fixes#20626.
Closesscylladb/scylladb#20737
* github.com:scylladb/scylladb:
tablet: Fix single-sstable split when attaching new unsplit sstables
replica: Fix tablet split execute after restart
(cherry picked from commit bca8258150)
Ref scylladb/scylladb#21415
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
sstable B is split first, and moved from main (unsplit) group to a
split-ready group
now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes https://github.com/scylladb/scylladb/issues/20044.
Please replace this line with justification for the backport/* labels added to this PR
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.
(cherry picked from commit bcd358595f)
(cherry picked from commit 93815e0649)
Refs https://github.com/scylladb/scylladb/pull/20939Closesscylladb/scylladb#21206
* github.com:scylladb/scylladb:
replica: Fix tombstone GC during tablet split preparation
service: Improve error handling for split
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes#20044.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 93815e0649)
Currently, test_repair_succeeds_with_unitialized_bm checks whether
repair finishes successfully and the error is properly handled
if batchlog_manager isn't initialized. Error handling depends on
logs, making the test fragile to external conditions and flaky.
Drop the error handling check, successful repair is a sufficient
passing condition.
Fixes: #21167.
(cherry picked from commit 85d9565158)
Closesscylladb/scylladb#21330
The skipped ranges should be multiplied by the number of tables
Otherwise the finished ranges ratio will not reach 100%.
Fixes#21174
(cherry picked from commit cffe3dc49f)
(cherry picked from commit 1392a6068d)
(cherry picked from commit 9868ccbac0)
Refs #21252Closesscylladb/scylladb#21313
* github.com:scylladb/scylladb:
test: Add test_node_ops_metrics.py
repair: Make the ranges more consistent in the log
repair: Fix finished ranges metrics for removenode
Despite OSS doesn't limit number of created service levels, match the
enterprise limit to decrease divergence in the test between OSS and
enterprise.
Fixesscylladb/scylladb#21044
(cherry picked from commit 846d94134f)
Closesscylladb/scylladb#21282
Fixes#21159
When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.
Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.
(cherry picked from commit c28a5173d9)
Closesscylladb/scylladb#21393
In the current scenario, the nodetool status doesn’t display information
regarding zero token nodes. For example, if 5 nodes are spun by the
administrator, out of which, 2 nodes are zero token nodes, then nodetool
status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id
” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
Robust topology tests are added, which spins up scylla nodes and confirm nodetool
status output for various cases, providing good coverage.
A test is also added in nodetool/test_status.py to verify this logic. These tests fail
without this commit’s zero token node support logic, hence verifying the behavior.
The test `test_status_keyspace_joining_node` has been removed. This test is
based on case where host_id=None, which is impossible. Since we now use
host_id_map for node discovery in nodetool, the nodes with "host_id=None"
go undetected. Since this case is anyway impossible, we can get rid of this.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions dont support zero token nodes.
Fixes: scylladb/scylladb#19849
(cherry picked from commit c00d40b239)
ALTER tablets-enabled KEYSPACES (KS) may fail due to
group0_concurrent_modification, in which case it's repeated by a for
loop surrounding the code. But because raft's add_entry consumes the
raft's guard (by std::move'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned for loop altogether and rethrow the exception, as the rf_change event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.
Fixes: https://github.com/scylladb/scylladb/issues/21102
Should be backported to every 6.x branch, as it may lead to a crash.
(cherry picked from commit de511f56ac)
(cherry picked from commit 3f4c8a30e3)
(cherry picked from commit 522bede8ec)
Refs https://github.com/scylladb/scylladb/pull/21121Closesscylladb/scylladb#21256
* github.com:scylladb/scylladb:
test: topology: add disable_schema_agreement_wait utility function
test: add UT to test retrying ALTER tablets KEYSPACE
cql/tablets: fix indentation in `rf_change` event handler
cql/tablets: fix retrying ALTER tablets KEYSPACE
Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will happen.
Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed.
After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress.
Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix)
Fixes: https://github.com/scylladb/scylladb/issues/21263
This fixes (indirectly) a regression introduced by d98708013c so it has to be backported to 6.2
(cherry picked from commit e1d8cddd09)
Refs scylladb/scylladb#21058Closesscylladb/scylladb#21178
* github.com:scylladb/scylladb:
test/boost/mutation_test: add test for multishard permit safety
test/lib/reader_lifecycle_policy: add semaphore factory to constructor
test/lib/reader_lifecycle_policy: rename factory_function
repair/row_level: drop now unneeded release_base_resource() calls
readers/multishard: make multishard reader safe to create with admitted permits
The test_view_build_status_migration_to_v2 test case creates a new view
(vt2) after peforming the view_build_status -> view_build_status_v2
migration and waits until it is built by `wait_for_view_v2` function. It
works by waiting until a SELECT from view_build_status_v2 will return
the expected number of rows for a given view.
However, if the host parameter is unspecified, it will query only one
node on each attempt. Because `view_build_status_v2` is managed via
raft, queries always return data from the queried node only. It might
happen that `wait_for_view_v2` fetches expected results from one node
while a different node might be lagging behind the group0 coordinator
and might not have all data yet.
In case of test_view_build_status_migration_to_v2 this is a problem - it
first uses `wait_for_view_v2` to wait for view, later it queries
`view_build_status_v2` on a random node and asserts its state - and
might fail because that node didn't have the newest state yet.
Fix the issue by issuing `wait_for_view_v2` in parallel for all nodes in
the cluster and waiting until all nodes have the most recent state.
Fixes: scylladb/scylladb#21060
(cherry picked from commit a380a2efd9)
Closesscylladb/scylladb#21129
Add a test checking that the multishard reader will not deadlock, when
created with an admitted permit, on a semaphore with a single count
resource.
(cherry picked from commit e1d8cddd09)
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.
(cherry picked from commit 5a3fd69374)
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.
(cherry picked from commit c8598e21e8)
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.
Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.
Fixes#20916
`perf-simple-query` stats before and after this fix :
`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59393 insns/op, 24029 cycles/op, 0 errors)
97551.14 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59376 insns/op, 23966 cycles/op, 0 errors)
96599.92 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59367 insns/op, 23998 cycles/op, 0 errors)
97774.91 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59370 insns/op, 23968 cycles/op, 0 errors)
97796.13 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59368 insns/op, 23947 cycles/op, 0 errors)
throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19
// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59392 insns/op, 24058 cycles/op, 0 errors)
97311.48 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59375 insns/op, 24005 cycles/op, 0 errors)
98043.10 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59381 insns/op, 23941 cycles/op, 0 errors)
96750.31 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59396 insns/op, 24025 cycles/op, 0 errors)
93381.21 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59390 insns/op, 24097 cycles/op, 0 errors)
throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```
This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.
Closesscylladb/scylladb#20985
* github.com:scylladb/scylladb:
topology-custom: add test to verify tombstone gc in read path
replica/table: check memtable before discarding tombstone during read
compaction_group: track maximum timestamp across all sstables
(cherry picked from commit 519e167611)
Backported from #20985 to 6.2.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#21251
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.
(cherry picked from commit 522bede8ec)
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
deselect remove_data_dir_of_dead_node event from test_random_failures
due to ussue #20751
(cherry picked from commit 9b0e15678e)
Closesscylladb/scylladb#21138
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.
Fixes: #21034
Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.
(cherry picked from commit f8475915fb)
Closesscylladb/scylladb#21107
This is a manual backport of #20788
When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables:
1. sstable containing an expired tombstone
2. sstable with additional data
3. sstable containing data which is shadowed by the expired tombstone in sstable 1
If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica.
This change fixes this problem by disabling tombstone garbage collection for pending replicas.
This fixes a problem in Enterprise, but the change is in OSS in order to have as few differences between OSS and Enterprise and to have a common infrastructure for disabling tombstone GC on pending replicas.
Fixes#21090Closesscylladb/scylladb#21061
* github.com:scylladb/scylladb:
test: test tombstone GC disabled on pending replica
tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
database::table: add tombstone_gc_enabled(locator::tablet_id)
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.
Test for scylladb/scylladb#20625
(cherry picked from commit 132358dc92)
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.
Fixes: #20240
(cherry picked from commit e0c1a51642)
Closesscylladb/scylladb#21022
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.
Fixes: #20039
Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.
(cherry picked from commit 042825247f)
(cherry picked from commit adf453af3f)
(cherry picked from commit 9c5950533f)
(cherry picked from commit 47acdc1f98)
(cherry picked from commit 93d61d7031)
(cherry picked from commit 6676e47371)
(cherry picked from commit 2aabe7f09c)
(cherry picked from commit ee56bbfe61)
Refs #20208Closesscylladb/scylladb#21009
* github.com:scylladb/scylladb:
cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
cql: join new and old KS options in ALTER tablets KS
cql: fix validation of ALTERing RFs in tablets KS
cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
cql: validate RF change for new DCs in ALTER tablets KS
cql: extend test_alter_tablet_keyspace_rf
cql: refactor test_tablets::test_alter_tablet_keyspace
cql: remove unused helper function from test_tablets