Raft topology goes over all nodes in a 'left' state and triggers 'remove
node' notification in case id/ip mapping is available (meaning the node
left recently), but the problem is that, since the mapping is not removed
immediately, when multiple nodes are removed in succession a notification
for the same node can be sent several times. Fix that by sending
notification only if the node still exists in the peers table. It will
be removed by the first notification and following notification will not
be sent.
Closesscylladb/scylladb#27743
(cherry picked from commit 4a5292e815)
Closesscylladb/scylladb#27911
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.
When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.
In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.
Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
fails the test
This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.
Fixes: #26347Closesscylladb/scylladb#27245
(cherry picked from commit d883ff2317)
Closesscylladb/scylladb#27505
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.
Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.
Fixes: scylladb/scylladb#27136
Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.
Closesscylladb/scylladb#27737
(cherry picked from commit 2a75b1374e)
Closesscylladb/scylladb#27780
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.
(cherry picked from commit 86dde50c0d)
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.
(cherry picked from commit 82f80478b8)
More efficient than 100 pings.
There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.
(cherry picked from commit f83c4ffc68)
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Fixes#26835
(cherry picked from commit 4a85ea8eb2)
Key goals:
- efficient (batching updates)
- reliable (no lost updates)
Will be used in data structures maintained on one designed owning
shard and replicated to other shards.
(cherry picked from commit ed8d127457)
Fixes#24346
When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.
This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.
Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).
v2:
* Added unit test
Closesscylladb/scylladb#27236
(cherry picked from commit 59c87025d1)
Closesscylladb/scylladb#27340
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes
It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.
Fixes#26229.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#27078
(cherry picked from commit 74ecedfb5c)
Closesscylladb/scylladb#27152
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.
scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.
Fixes#27055Closesscylladb/scylladb#27075
(cherry picked from commit e35ba974ce)
Closesscylladb/scylladb#27108
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.
Fixes https://github.com/scylladb/scylladb/issues/26866
Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing.
- (cherry picked from commit e872f9cb4e)
- (cherry picked from commit 0f0ab11311)
Parent PR: #26868Closesscylladb/scylladb#27091
* github.com:scylladb/scylladb:
cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
cleanup: Add RESTful API to allow reset cleanup needed flag
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).
Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.
A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
Some accesses to `auth::service` are not affected and we do not modify
those.
Fixes scylladb/scylladb#26816
Backport: yes. This is a fix of a regression.
- (cherry picked from commit c0f7622d12)
- (cherry picked from commit 222eab45f8)
- (cherry picked from commit 394207fd69)
- (cherry picked from commit b357c8278f)
Parent PR: #26856Closesscylladb/scylladb#27034
* github.com:scylladb/scylladb:
test/cluster/test_maintenance_mode.py: Wait for initialization
test: Disable maintenance mode correctly in test_maintenance_mode.py
test: Fix keyspace in test_maintenance_mode.py
service/qos: Do not crash Scylla if auth_integration absent
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.
(cherry picked from commit 0f0ab11311)
Load-and-stream is broken when running concurrently to the finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes https://github.com/scylladb/scylladb/issues/26455.
- (cherry picked from commit 3abc66da5a)
- (cherry picked from commit 4654cdc6fd)
Parent PR: #26456Closesscylladb/scylladb#26647
* github.com:scylladb/scylladb:
sstables_loader: Don't bypass synchronization with busy topology
test: Add reproducer for l-a-s and split synchronization issue
sstables_loader: Synchronize tablet split and load-and-stream
Any empty object of the json::json_list type has its internal
_set variable assigned to false which results in such objects
being skipped by the json::json_builder.
Hence, the json returned by the api GET//compaction_manager/compaction_history
does not contain the field `rows_merged` if a cell in the
system.compaction_history table is null or an empty list.
In such cases, executing the command `nodetool compactionhistory`
will result in a crash with the following error message:
`error running operation: rjson::error (JSON assert failed on condition 'false'`
The patch fixes it by checking if the json object contains the
`rows_merged` element before processing. If the element does
not exist, the nodetool will now produce an empty list.
Fixes https://github.com/scylladb/scylladb/issues/23540Closesscylladb/scylladb#23514
(cherry picked from commit 113647550f)
Closesscylladb/scylladb#26947
This patch fixes 2 issues at one go:
First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.
Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.
Fixes#26982
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#26991
(cherry picked from commit f9ce98384a)
Closesscylladb/scylladb#27033
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:
```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```
To avoid that, we should wait until initialization is complete.
(cherry picked from commit b357c8278f)
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.
(cherry picked from commit 394207fd69)
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.
The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.
Fixes#22707
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#26730
(cherry picked from commit 7f34366b9d)
(cherry picked from commit 4c466ace4f)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879
](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too.
Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported.
Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit.
Fixes#26369
This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1.
- (cherry picked from commit da2ac90bb6)
- (cherry picked from commit 5d1913a502)
- (cherry picked from commit aba4c006ba)
- (cherry picked from commit 85f059c148)
- (cherry picked from commit 7ec9e23ee3)
Parent PR: #26909Closesscylladb/scylladb#27013
* github.com:scylladb/scylladb:
test: cqlpy: add test case for non-numeric PERCENTILE value
schema: speculative_retry: update exception type for sstring ops
docs: cql: ddl.rst: update speculative-retry-options
test: cqlpy: add test for valid speculative_retry values
schema: speculative_retry: allow 0 and 100 PERCENTILE values
Add test case for non-numeric PERCENTILE value, which raises an error
different to the out-of-range invalid values. Regex in the test
test_invalid_percentile_speculative_retry_values is expanded.
Refs #26369
(cherry picked from commit 7ec9e23ee3)
test_valid_percentile_speculative_retry_values is introduced to test that
valid values for speculative_retry are properly accepted.
Some of the values are moved from the
test_invalid_percentile_speculative_retry_values test, because
the previous commit added support for them.
Refs #26369
(cherry picked from commit 5d1913a502)
This is backport of fix for https://github.com/scylladb/scylladb/issues/26040 and related test (https://github.com/scylladb/scylladb/pull/26589) to 2025.2.
Before this change, unauthorized connections stayed in main
scheduling group. It is not ideal, in such case, rather sl:default
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.
This commit adds a call to update_scheduling_group at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to sl:default.
Fixes: https://github.com/scylladb/scylladb/issues/26040
Fixes: https://github.com/scylladb/scylladb/issues/26581
(cherry picked from commit 278019c328)
(cherry picked from commit 8642629e8e)
No backport, as it's already a backport (but similar PRs will be created for 2025.3, and 2025.4)
Closesscylladb/scylladb#26813
* github.com:scylladb/scylladb:
test: add test_anonymous_user to test_raft_service_levels
transport: call update_scheduling_group for non-auth connections
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.
Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.
Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.
This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.
Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581Closesscylladb/scylladb#26589
(cherry picked from commit 8642629)
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.
That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26595
(cherry picked from commit d9bfbeda9a)
Closesscylladb/scylladb#26759
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.
Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.
Fixes#19959Closesscylladb/scylladb#26638
(cherry picked from commit 5321720853)
Closesscylladb/scylladb#26755
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.
This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.
Fixes#25195.
- (cherry picked from commit 1106157756)
- (cherry picked from commit ea41f652c4)
- (cherry picked from commit a7e46974d4)
- (cherry picked from commit e1d9c83406)
- (cherry picked from commit 8d5bd212ca)
- (cherry picked from commit 6ba0fa20ee)
- (cherry picked from commit 8410532fa0)
Parent PR: #26003Closesscylladb/scylladb#26300
* github.com:scylladb/scylladb:
test/cluster: Add tests for invalid SSTable compression options
test/boost: Add tests for SSTable compression config options
main: Validate SSTable compression options from config
db/config: Add SSTable compression options for user tables
db/config: Prepare compression_parameters for config system
compressor: Validate presence of sstable_compression in parameters
compressor: Add missing space in exception message
Complementary to the previous patch. It triggers semantic validation
checks in `compression_parameters::validate()` and expects the server to
exit. The tests examine both command line and YAML options.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 8410532fa0)
Since patch 03461d6a54, all boost unit tests depending on `cql_test_env`
are compiled into a single executable (`combined_tests`). Add the new
test in there.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 6ba0fa20ee)
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.
(cherry picked from commit da8748e2b1)
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.
Fixesscylladb/scylladb#24982Closesscylladb/scylladb#26472
(cherry picked from commit 7c6e84e2ec)
Closesscylladb/scylladb#26550
Using `driver_connect()` after a cluster restart isn't enough to ensure
full CQL availability, but the test assumes that it is.
Fix that by making the test wait for CQL availability via `get_ready_cql()`.
Also, replace some manual usages of wait_for_cql_and_get_hosts with
`get_ready_cql()` too.
Fixesscylladb/scylladb#25362Closesscylladb/scylladb#25366
(cherry picked from commit 85fd4d23fa)
Closesscylladb/scylladb#26513
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.
in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.
(cherry picked from commit 3c3dd4cf9d)
Consider this:
1) merge finishes, wakes up fiber to merge compaction groups
2) drop table happens, which in turn invokes truncate underneath
3) merge fiber stops old groups
4) truncate disables compaction on all groups, but the ones stopped
5) truncate performs a check that compaction has been disabled on
all groups, including the ones stopped
6) the check fails because groups being stopped didn't have compaction
explicitly disabled on them
To fix it, the check on step 6 will ignore groups that have been
stopped, since those are not eligible for having compaction explicitly
disabled on them. The compaction check is there, so ongoing compaction
will not propagate data being truncated, but here it happens in the
context of drop table which doesn't leave anything behind. Also, a
group stopped is somewhat equivalent to compaction disabled on it,
since the procedure to stop a group stops all ongoing compaction
and eventually removes its state from compaction manager.
Fixes#25551.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#25563
(cherry picked from commit 149f9d8448)
Closesscylladb/scylladb#25631
Currently, while stopping the compaction_manager, we stop task_manager
compaction module and concurrently run compaction_manager::really_do_stop.
really_do_stop stops and waits for all task_executors that are kept
in compaction_manager::_tasks, but nothing ensures that no more tasks will
be added there. Due to leftover tasks, we trigger on_fatal_internal_error.
Modify the order of compaction_manager::stop. After the change, we stop
compaction tasks in the following order:
- abort module abort source;
- close module gate in the background;
- stop_ongoing_compactions (kept in compaction_manager::_tasks);
- wait until module gate is closed.
Check module abort source before creating compaction executor and
adding it to _tasks.
Thanks to the above, we can be sure that:
- after module::stop there will be no tasks in _tasks;
- compaction_manager::stop aborts all tasks; we don't wait for any whole
compaction to finish.
Fixes: https://github.com/scylladb/scylladb/issues/25806.
Fixes shutdown bug; Needs backports to all version
- (cherry picked from commit 17707d0e6b)
- (cherry picked from commit 97c77d7cd5)
Parent PR: #25885Closesscylladb/scylladb#26223
* github.com:scylladb/scylladb:
compaction: move _tasks check
compaction: stop compaction module in really_do_stop
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes
After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged.
This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart.
To fix this, let's make split compaction always work with the state when it started, not a global state.
Fixes#24153.
All 2025.* versions are vulnerable, so fix must be backported to them.
- (cherry picked from commit 0c1587473c)
- (cherry picked from commit 68f23d54d8)
Parent PR: #25690Closesscylladb/scylladb#25934
* github.com:scylladb/scylladb:
replica: Fix split compaction when tablet boundaries change
replica: Futurize split_compaction_options()
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes
After last step, split compaction initiated in step 2 can fail
because it works with the global tablet map, rather than the
map when the compaction started. With the global state changing
under its feet, on merge, the mutation splitting writer will
think it's going backwards since sibling tablets are merged.
This problem was also seen when running load-and-stream, where
split initiated by the sstable writer failed, split completed,
and the unsplit sstable is left in the table dir, causing
problems in the restart.
To fix this, let's make split compaction always work with
the state when it started, not a global state.
Fixes#24153.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 68f23d54d8)
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.
The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.
Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
- new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.
Remove nightly mark.
Fixes: #26148.
Closesscylladb/scylladb#26209
(cherry picked from commit 48bbe09c8b)
Closesscylladb/scylladb#26219
Currently, compaction::task_manager_module is stopped in compaction_manager::stop,
concurrently to really_do_stop. We can't predict the order of the two.
Do not set _task_manager_module to nullptr at stop, because
compaction_manager::really_do_stop() may be called before the actual
shutdown, while other components still try to use it.
compaction::task_manager_module does not keep a pointer to compaction_manager,
so we won't end up with memory leak.
Stop compaction module in really_do_stop, after ongoing compactions
are stopped.
It's a preparation for further patches.
(cherry picked from commit 17707d0e6b)
Consider the following:
The tablet load balancer is working on:
- node1: an empty node (no tablets) with a large disk capacity
- node2: an empty node (no tablets) with a lower disk capacity then node1
- node3: is being decommissioned and contains tablet replicas
In load_balancer::make_internode_plan() the initial destination
node/shard is selected like this:
// Pick best target shard.
auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)};
load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which
adds the host to load_sketch's internal hash maps in case the node was
not yet seen by load_sketch.
Let's assume dst is a shard on node1.
Later in load_balancer::make_internode_plan() we will call
pick_candidate() to try to find a better destination node than the
initial one:
// May choose a different source shard than src.shard or different destination host/shard than dst.
auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst,
drain_skipped);
auto source_tablets = candidate.tablets;
src = candidate.src;
dst = candidate.dst;
If pick_candidate() selects some other empty destination (due to larger
capacity: node1) node, and that node has not yet been seen by
load_sketch (because it was empty), a subsequent call to
load_sketch::pick() will search for the node using
std::unordered_map::at(), and because the node is not found it will
throw a std::out_of_bounds() exception crashing the load balancer.
This problem is fixed by changing load_sketch::populate() to initialize
its internal maps with all the nodes which populate()'s arguments
filter for.
Fixes: #26203Closesscylladb/scylladb#26207
(cherry picked from commit c6c9c316a7)
Closesscylladb/scylladb#26239
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.
To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.
A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.
Fixesscylladb/scylladb#24806Closesscylladb/scylladb#26057
(cherry picked from commit 35f7d2aec6)
Closesscylladb/scylladb#26200
All three tests could hit
https://github.com/scylladb/python-driver/issues/295. We use the
standard workaround for this issue: reconnecting the driver after
the rolling restart, and before sending any requests to local tables
(that can fail if the driver closes a connection to the node that
restarted last).
All three tests perform two rolling restarts, but the latter ones
already have the workaround.
Fixes#26005Closesscylladb/scylladb#26056
(cherry picked from commit a56115f77b)
Closesscylladb/scylladb#26197