Commit Graph

27 Commits

Author SHA1 Message Date
Botond Dénes
bde7d8ddbd Merge 'service: pass current session_id to repair rpc' from Aleksandra Martyniuk
Currently, in repair_tablet we retrieve session_id from tablet
map (and throw if it isn't specified). In case of topology
coordinator failover, we may end up in a situation where a node
runs outdated repair, treating session of a different operation
as the repair's session:

- topology coordinator starts repair transition (A);
- topology coordinator sends tablet repair rpc to node1;
- topology coordinator is separated from the cluster;
- new topology coordinator is elected;
- new topology coordinator sees waiting repair request (A_2)
  and executes it;
- new repair of the same tablet is requested (B);
- new topology coordinator starts repair transition (B);
- new topology coordinator sends tablet repair rpc to node2;
- node2 starts repair (B) as repair master;
- node1 starts repair (A), checks the current session (B), proceeds
  with repair (B) as repair master.

Send current session_id in repair_tablet rpc. If this session_id
and session id got from tablet map don't match, an exception
is thrown.

Fixes: https://github.com/scylladb/scylladb/issues/23318.

No backport; changes in rpc signatures

Closes scylladb/scylladb#25369

* github.com:scylladb/scylladb:
  test: check that repair with outdated session_id fails
  service: pass current session_id to repair rpc
2025-09-17 17:28:35 +03:00
Tomasz Grabiec
a7f10b585e Merge 'drop table: fix crash on drop table with concurrent cleanup' from Ferenc Szili
Consider the following scenario:

- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash

Fixes: #25706

This needs to be backported to all supported versions with tablets

Closes scylladb/scylladb#25708

* github.com:scylladb/scylladb:
  test: reproducer and test for drop with concurrent cleanup
  truncate: check for closed storage group's gate in discard_sstables
2025-09-02 00:02:14 +02:00
Aleksandra Martyniuk
33a547e740 test: check that repair with outdated session_id fails 2025-08-29 17:00:48 +02:00
Ferenc Szili
1b8a44af75 test: reproducer and test for drop with concurrent cleanup
This change adds a reproducer and test for issue #25706
2025-08-28 16:51:36 +02:00
Michał Jadwiszczak
cf138da853 test: adjust existing tests
- Disable tablets in `test_migration_on_existing_raft_topology`.
Because views on tablets are experimental now, we can safely
assume that view building coordinator will start with view build status
on raft.
- Add error injection to pause view building on worker.
Used to pause view building process, there is analogous error injection
in view_builder.
- Do a read barrier in `test_view_in_system_tables`
Increases test stability by making sure that the node sees up-to-date
group0 state and `system.built_views` is synced.
- Wait for view is build in some tests
Increases tests stability by making sure that the view is built.
- Remove xfail marker from `test_tablet_streaming_with_unbuilt_view`
This series fix https://github.com/scylladb/scylladb/issues/21564
and this test should work now.
2025-08-27 10:23:04 +02:00
Artsiom Mishuta
4b975668f6 tiering (test.py): introduce tiering labels
introduce tiering marks
1 “unstable” - For unstable tests that will be will continue runing every night and generate up-to-date statistics with failures without failing the “Main” verification path(scylla-ci, Next)

2 “nightly” - for tests that are quite old, stable, and test functionality that rather not be changed or affected by other features, are partially covered in other tests, verify non-critical functionality, have not found any issues or regressions, too long to run on every PR,  and can be popped out from the CI run.

set 7 long tests(according to statistic in elastic) as nightly(theses 8 tests took 20% of CI run,
about 4 hours without paralelization)
1 test as unstable(as exaple ot marker usage)

Closes scylladb/scylladb#24974
2025-08-04 15:38:16 +03:00
Taras Veretilnyk
1d6808aec4 topology_coordinator: Make tablet_load_stats_refresh_interval configurable
This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds'
that allows overriding the default value without using error injection.

Fixes scylladb/scylladb#24641

Closes scylladb/scylladb#24746
2025-07-31 14:31:55 +03:00
Avi Kivity
f7324a44a2 compaction: demote normal compaction start/end log messages to debug level
Compaction is routine and the log messages pollute the log files,
hiding important information.

All the data is available via `nodetool compactionhistory`.

Reduce noise by demoting those log messages to debug level.

One test is adjusted to use debug level for compaction, since it
listens for those messages.

Closes scylladb/scylladb#24949
2025-07-29 08:02:22 +03:00
Aleksandra Martyniuk
a0031ad05e api: repair_async: forbid repairing tablet keyspaces
Return 403 Forbidden if a user tries to repair tablet keyspace with
/storage_service/repair_async/ API.
2025-07-24 11:11:09 +02:00
Aleksandra Martyniuk
83c9af9670 test: add test for repair and resize finalization
Add test that checks whether repair does not start if there is an
ongoing resize finalization.
2025-06-11 16:17:39 +02:00
Evgeniy Naydanov
f6e3fdd778 test.py: rework log_browsing for dtest migration
Rework `ScyllaLogFile.wait_for()` method to make it easier
to add required methods to ScyllaNode class of ccm-like shim.

Also, added `ScyllaLogFile.grep_for_errors()` method and
reworked `ScyllaLogFile.grep()`
2025-05-19 11:50:55 +00:00
Dawid Mędrek
c4b32c38a3 test/cluster: Disable rf_rack_valid_keyspaces in problematic tests
Some of the tests in the test suite have proven to be more problematic
in adjusting to RF-rack-validity. Since we'd like to run as many tests
as possible with the `rf_rack_valid_keyspaces` configuration option
enabled, let's disable it in those. In the following commit, we'll enable
it by default.
2025-05-10 16:30:49 +02:00
Dawid Mędrek
c8c28dae92 test/cluster/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
Three tests in the file use a multi-DC cluster. Unfortunately, they put
all of the nodes in a DC in the same rack and because of that, they fail
when run with the `rf_rack_valid_keyspaces` configuration option enabled.
Since the tests revolve mostly around zero-token nodes and how they
affect replication in a keyspace, this change should have zero impact on
them.
2025-05-10 16:30:46 +02:00
Dawid Mędrek
04567c28a3 test/cluster/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
We reduce the number of nodes and the RF values used in the test
to make sure that the test can be run with the `rf_rack_valid_keyspaces`
configuration option. The test doesn't seem to be reliant on the
exact number of nodes, so the reduction should not make any difference.
2025-05-10 16:30:43 +02:00
Dawid Mędrek
dbb8835fdf test/cluster: Adjust simple tests to RF-rack-validity
We adjust all of the simple cases of cluster tests so they work
with `rf_rack_valid_keyspaces: true`. It boils down to assigning
nodes to multiple racks. For most of the changes, we do that by:

* Using `pytest.mark.prepare_3_racks_cluster` instead of
  `pytest.mark.prepare_3_nodes_cluster`.
* Using an additional argument -- `auto_rack_dc` -- when calling
  `ManagerClient::servers_add()`.

In some cases, we need to assign the racks manually, which may be
less obvious, but in every such situation, the tests didn't rely
on that assignment, so that doesn't affect them or what they verify.
2025-05-10 16:30:18 +02:00
Aleksandra Martyniuk
76cd707b18 test: test_tablets: wait for cql
Wait for cql after rolling restart in test_two_tablets_concurrent_repair_and_migration_repair_writer_level
to prevent failing queries.

Fixes: #23620.

Closes scylladb/scylladb#23796
2025-04-24 21:25:29 +03:00
Tomasz Grabiec
001d3b2415 Merge 'storage_service: preserve state of busy topology when transiting tablet' from Łukasz Paszkowski
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.

Unit test:

Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.

Fixes https://github.com/scylladb/scylladb/issues/20073.

Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.

Closes scylladb/scylladb#23751

* github.com:scylladb/scylladb:
  storage_service: add unit test for mid-decommission transit_tablet()
  storage_service: preserve state of busy topology when transiting tablet
2025-04-16 00:19:24 +02:00
Laszlo Ersek
841ca652a0 storage_service: add unit test for mid-decommission transit_tablet()
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2025-04-15 15:15:25 +02:00
Dawid Mędrek
a59842257a test: Move test_alter_tablet_keyspace_rf to cluster suite
We move the test `test_alter_tablet_keyspace_rf` from the cqlpy to the
cluster test suite. The reason behind the change is that the test cannot
be run with `rf_rack_valid_keyspaces` turned on in the configuration.
During the test, we make the keyspace RF-rack-invalid multiple times.
Since RF-rack-validity is a very strong constraint, adjust the test
otherwise is impossible.

By moving it to the cluster test suite, we're able to change the
configuration of the node used in the test, and so the test can work
again.
2025-04-11 14:55:11 +02:00
Dawid Mędrek
0ed21d9cc1 test/cluster/test_tablets.py: Fix test errorneous indentation
Some of the statements in the test are not indented properly
and, as a result, are never run. It's most likely a small mistake,
so let's fix it.

Closes scylladb/scylladb#23659
2025-04-10 11:06:01 +03:00
Aleksandra Martyniuk
372b562f5e test: add test for rebuild with repair 2025-04-08 10:42:02 +02:00
Botond Dénes
fcdae20fd1 Merge 'Add tablet enforcing option' from Benny Halevy
This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing
`enable_tablets` option. It can be set to the following values:
    disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option
    enabled:  New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option
    enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option

`tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether
tablets are disabled or enabled by default for new keyspaces, respectively.
In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}`
keyspace option, when the keyspace is created.

`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces,
like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`

Refs scylladb/scylla-enterprise#4355

* Requires backport to 2025.1

Closes scylladb/scylladb#22273

* github.com:scylladb/scylladb:
  boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
  tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
  db/config: add tablets_mode_for_new_keyspaces option
2025-04-03 16:32:19 +03:00
Aleksandra Martyniuk
bae6711809 \test: add test to check concurrent migration and repair of two different tablets 2025-04-02 15:30:17 +02:00
Lakshmi Narayanan Sreethar
5b47d84399 topology_coordinator: do not schedule migrations when there are pending resize finalizations
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.

To fix this, do not schedule tablet migrations on any tables when there
are pending resize finalizations. This ensures that migrations from the
same table and other unrelated tables do not block resize finalization.

Also added a testcase to verify the fix.

Fixes #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-03-27 10:16:34 +05:30
Benny Halevy
c62865df90 db/config: add tablets_mode_for_new_keyspaces option
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-24 14:54:45 +02:00
Aleksandra Martyniuk
20f9d7b6eb test: add test to check concurrent tablets migration and repair
Add a test to check whether a tablet can be migrated while another
tablet is repaired.
2025-03-17 10:37:03 +01:00
Artsiom Mishuta
d1198f8318 test.py: rename topology_custom folder to cluster
rename topology_custom folder to cluster
as it contains not only topology test cases
2025-03-04 10:32:44 +01:00