Commit Graph

45873 Commits

Author SHA1 Message Date
Ferenc Szili
dc375b8cd3 test: enable test_truncate_with_coordinator_crash
This test was added in PR #19789 but was disabled with xfail because of
the bug with way truncate saved the commit log replay positions. More
specifically, the replay positions for shards that had no mutations were
saved to system.truncated with shard_id == 0, regardless for which shard
it was actually saved for (see #21719).
The bug was fixed in #21722, so this change removes the xfail tag from
the test.

Closes scylladb/scylladb#21902
2024-12-18 18:02:52 +01:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Botond Dénes
1a717f3014 service/storage_proxy: data_resolver::resolve(): apply mutations gently
The data resolved has to apply all mutations from all replica to a
single mutation. In the extreme case, when all rows are dead, the
mutations can have around 10K rows in them. This is not a huge amount,
but it is enough to cause moderate stalls of <20ms.
To avoid this, use the gentle variant of apply(), which can yield in the
middle.

Fixes: scylladb/scylladb#21818

Closes scylladb/scylladb#21884
2024-12-18 15:21:19 +01:00
Kefu Chai
e65fc35b5e replica: do not include unused headers
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21836
2024-12-18 13:52:57 +02:00
Avi Kivity
5a849b0a6a Merge "Move more subsystems to use host ids instead of ips" from Gleb
"
This series converts repair, streaming and node_ops (and some parts of
alternator) to work on host ids instead of ips. This allows to remove
a lot of (but not all) functions that work on ips from effective
replication map.

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13830/

Refs: scylladb/scylladb#21777
"

* 'gleb/move-to-host-id-more' of github.com:scylladb/scylla-dev:
  locator: topology: remove no longer use get_all_ips()
  gossiper: change get_unreachable_nodes to host ids
  locator: drop no longer used ip based functions from effective replication map and friends
  test: move network_topology_strategy_test and token_metadata_test to use host id based APIs
  replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges
  replica/mutation_dump: use host ids instead of ips
  alternator: move ttl to work with host ids instead of ips
  storage_service: move node_ops code to use host ids instead of host ips
  streaming: move streaming code to use host ids instead of host ips
  repair: move repair code to use host ids instead of host ips
  gossiper: add get_unreachable_host_ids() function
  locator: topology: add more function that return host ids to effective replication map
  locator: add more function that return host ids to effective replication map
2024-12-18 13:48:22 +02:00
Piotr Dulikowski
d067d8caef Merge 'More Python tests for materialized view and Alternator GSI feature' from Nadav Har'El
This patch includes more tests (in Python) that I wrote while implementing the Alternator UpdateTable feature for adding a GSI to an existing table (https://github.com/scylladb/scylladb/issues/11567).

I explain each of these tests in the separate patches below, but basically they fall into two types:
1. Tests which pass with today's materialized views and Alternator GSI/LSI, and serve to ensure that whatever changes I do to the view update implementation, doesn't break corner cases that already worked.
2. Tests for the UpdateTable feature in Alternator which doesn't work today so xfail - and will need to work for #11567. We already had a few tests for this, but here I add more and improve coverage of various corner cases I discovered while implementing the featue.

I already have a working prototype for #11567 which passes all these tests. Many of these tests helped exposed various bugs in earlier versions of my code.

Closes scylladb/scylladb#21927

* github.com:scylladb/scylladb:
  test/cqlpy: a few more functional tests for materialized views
  test/alternator: more tests for UpdateTable create and delete GSI
  test/alternator: make UpdateTable tests wait less
  test/alternator: move UpdateTable tests to a separate file
  test/alternator: add another test for elaborate GSI updates
  test/alternator: test that DescribeTable returns IndexStatus for GSI
  test/alternator: fix wrong test for UpdateTable metrics
  test/alternator: add test for missing attribute in item in LSI
  test/alternator: test that DescribeTable doesn't return IndexStatus for LSI
  test/alternator: add tests for RBAC for create and delete GSI
2024-12-17 20:43:07 +01:00
Yaron Kaikov
3a00ffd2eb build_docker.sh: remove rsyslog installation and conf
It seems that no one is using rsyslog, so there is no point having it
inside our container (see https://github.com/scylladb/scylladb/issues/21923#issuecomment-2545191667)

Refs: https://github.com/scylladb/scylladb/issues/21923

Closes scylladb/scylladb#21953
2024-12-17 17:34:35 +02:00
Avi Kivity
01cdba9a98 Merge 'cache_algorithm_test: fix flaky failures' from Michał Chojnowski
This series attempts to get read of flakiness in `cache_algorithm_test` by solving two problems.

Problem 1:

The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.

Problem 2:

Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

Fixes #21536

Should be backported to prevent flaky failures in older branches.

Closes scylladb/scylladb#21948

* github.com:scylladb/scylladb:
  cache_algorithm_test: harden against stats being confused by background activity
  cache_algorithm_test: fix a use of an uninitialized variable
2024-12-17 14:46:43 +02:00
Botond Dénes
73fc135e02 Merge 'test.py: make sure topology/ and topology_custom/ passes with tablets on.' from Konstantin Osipov
Explicitly disable tablets in a few tests that rely on features not yet supported with tablets.

Closes scylladb/scylladb#21070

* github.com:scylladb/scylladb:
  test: disable tablets in test_raft_fix_broken_snapshot
  test: disable tablets in test_raft_recovery_stuck
  test: disable tablets in tet_raft_recovery_majority_lost
  test: don't run test_raft_recovery_basic with tablets
  test: fix test_writes_to_previous_cdc_generations work with tablets
  test: fix topology_custom/test_mv_topology_change.py to work with tablets
  test: correct replication factor in test_multidc.py
  test: update test_view_build_status to work with tablets
  test: fix test_change_rpc_address with tablets.
  test: explicitly disable tablets in test_gropu0_schema_versioning
  test: disable tablets in topology/test_mutation_schema_change.py
  test: disable tablets in topology/test_mv.py
2024-12-17 08:38:10 +02:00
Aleksandra Martyniuk
d0cda8ebef replica: check enabled features in tablet_map_to_mutation
Before adding a value to a new column in tablet_map_to_mutation
check if the column is supported by the whole cluster.

Closes scylladb/scylladb#21941
2024-12-17 07:02:11 +02:00
Michał Chojnowski
6caaead4ac cache_algorithm_test: harden against stats being confused by background activity
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
2024-12-16 23:14:30 +01:00
Michał Chojnowski
1fffd976a4 cache_algorithm_test: fix a use of an uninitialized variable
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.
2024-12-16 23:14:13 +01:00
Nadav Har'El
99e7fdef6d test/cqlpy: a few more functional tests for materialized views
This patch adds a few more functional tests for the CQL materialized
view feature in the cqlpy. The new tests pass, but helped me catch bugs (and
understand what are *not* bugs) while refactoring some view update code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 19:36:47 +02:00
Nadav Har'El
d9af154772 test/alternator: more tests for UpdateTable create and delete GSI
We already have in test_gsi_updatetable.py several functional tests for
the Alternator feature of adding or deleting a GSI on an existing table,
through the UpdateTable operation.
This patch adds many more tests for various corner cases of this feature -
tests developed in parallel with actually implementing that feature.

All test in test_gsi_updatetable.py pass on Amazon DynamoDB but currently
xfail on Alternator, due to the following issues:

 * #11567: Alternator: allow adding a GSI to a pre-existing table
 * #9424: Alternator GSIs should exclude items with empty-string key components

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 19:36:47 +02:00
Nadav Har'El
5c7b8c8e4d test/alternator: make UpdateTable tests wait less
The UpdateTable tests for creating and deleting a GSI need to wait for
the asynchronous operation of the view's building and deletion, using
two utility functions wait_for_gsi() and wait_for_gsi_gone().

Because I originally wrote these tests for DynamoDB and its extremely
high latency for these operations, these functions waited a whole second
before checking for the end of the wait. This whole-second sleep is
absurd in Alternator where building a small view takes just a fraction of
a second. So let's lower the sleep time from 1 second to 0.1 seconds,
and allow these tests to pass much faster on Alternator (once this
feature is implemented in Alternator, of course - until then all these
tests still fail immediately on an unimplemented operation).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 19:36:47 +02:00
Nadav Har'El
b1bd5cdf0f test/alternator: move UpdateTable tests to a separate file
The source file test/alternator/test_gsi.py has already grown very
large, so this patch moves all the existing tests related to using
UpdateTable to add or delete a GSIs to a separate file:
test_gsi_updatetable.py.

We just move tests here - no new tests or functional changes to the
tests - but did use the opportunity for some small improvements in
the comments.

In the next patch we'll add more tests to this new file.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 19:36:47 +02:00
Nadav Har'El
cc308bd0cc test/alternator: add another test for elaborate GSI updates
We have a test, test/alternator/test_gsi.py::test_update_gsi_pk which
created a GSI whose *partition key* was a regular column in the base
table, and exercised various elaborate updates requiring adding,
updating and deleting of rows from the materialized view.

In this patch, we add another similar test case, just for a *clustering
key*.

Both these tests are important regression tests - when we later
reimplement GSI we'll want to verify that none of the complex update
scenarios got broken (and indeed, some broken code did break these
tests).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
9094fe1608 test/alternator: test that DescribeTable returns IndexStatus for GSI
This patch adds a test reproducing issue #11471 - where DescribeTable
on a table that as an already built GSI (creating with the table itself)
must return IndexStatus == "ACTIVE".

This test passes on DynamoDB, but xfails on Alternator because of
issue #11471.

We actually had this check earlier, but it was part of a bigger xfailing
tests that checked multiple features. It's better to have it as a
separate test just for this feature, as we'll soon fix this issue and
make this test pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
1b120e3c7e test/alternator: fix wrong test for UpdateTable metrics
The test we had for counting Alternator operations metrics ran the
UpdateTable request without any parameters, which isn't actually a
valid call - Amazon DynamoDB rejects such a call, saying one of the
different parameters must be present, and we'll want to do that
later too.

So let's fix the test to use a valid UpdateTable request, one that
does the silly BillingMode='PAY_PER_REQUEST'. This is already the
current setting, so nothing is really changed, but it's still counted
as an operation in the metric.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
85088516b2 test/alternator: add test for missing attribute in item in LSI
Test that when a table has an LSI, then if the indexed attribute is
missing, the item is added to the base table but not the index.

We already have exactly the same test for GSI in test_gsi.py, but forgot
to do write the same test for LSI. It's important to test this scenario
separately for GSIs and LSIs because in an upcoming GSI reimplementation
we plan to make the GSI and LSI implementation slightly different, and
they can have separate bugs (and in fact, we had such an LSI-specific
bug in one broken implementation).

We also have the same scenario that is tested here in the test
test_streams.py::test_streams_updateitem_old_image_lsi_missing_column
but that was a Alternator Streams test and we should have a more basic
test for this scenario in test_lsi.py.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
b00f5a6070 test/alternator: test that DescribeTable doesn't return IndexStatus for LSI
Whereas GSIs have an IndexStatus when described by DescribeTable,
LSIs do not. The purpose of IndexStatus is to tell when the index is live,
and this is not needed for LSIs because they cannot be added to a base
table that already exists.

We already had a test for this, but it was hidden in an xfailing test
for many different DescribeTable attributes - so let's move it into it's
own, *passing*, test. The new tests passes on both Alternator and
Amazon DynamoDB.

This test is an important regression test for when we later add
IndexStatus support to GSI, and this test will ensure that we don't
accidentally introduce IndexStatus to LSIs as well - DynamoDB doesn't
generate it for LSIs so neither should Alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
373b37b5da test/alternator: add tests for RBAC for create and delete GSI
In later patches we will implement (as requested in issue #11567) the
UpdateTable operation for creating a new GSI or removing a GSI on an
existing table. In this patch we add to test/alternator/test_cql_rbac.py
tests to exhaustively check that the new operations will behave as expected
in respect to role-based access control (RBAC):

1. UpdateTable requires the ALTER permissions on the affected table -
   as was already the case before (and was documented in compatibility.md).
   This should also be true for the newly-implemented UpdateTable
   operations that create a GSI and delete a GSI, and we test that.

   The above statement may sound counter-intuitive - why does creating
   or deleting a GSI require ALTER permissions (on the base table), not
   CREATE or DROP permissions? But this makes sense when you consider
   that CREATE permissions should allow you create new independent tables,
   not to change the behavior or performance of existing tables (which
   adding a GSI does).

2. When a role has permissions to create a GSI, it should be able to
   read the new GSI (SELECT permissions). This is known as "auto-grant".

3. When a GSI is deleted, whatever permissions was set on it is revoked,
   so that if it's later recreated, the old permissions don't resurface.
   This is known as "auto-revoke".

Because the UpdateTable feature for creating and deleting a GSI is not
yet enabled, the new tests are all marked "xfail".

The new tests, like all tests in the file test/alternator/test_cql_rbac.py
are Scylla-only and are skipped on Amazon DynamoDB - because they test
the Scylla-only CQL-based role-based access control API.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:55:28 +02:00
Konstantin Osipov
686c0e517f test: disable tablets in test_raft_fix_broken_snapshot
The test is using force_gossip_topology_changes which
doesn't work with tablets.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
bba034202d test: disable tablets in test_raft_recovery_stuck
The test is using force_gossip_topology_mode which doesn't
work with tablets.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
3767a54696 test: disable tablets in tet_raft_recovery_majority_lost
The test is using force_gossip_topology_mode which doesn't
work with tablets.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
e961d692e6 test: don't run test_raft_recovery_basic with tablets
It uses force_gossip_topology_changes, which doesn't work
with tablets.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
d6fc0d5512 test: fix test_writes_to_previous_cdc_generations work with tablets
The test is testing CDC. CDC doesn't work with tablets.
Explicitly disable tablets in the keyspaces used by the test.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
169c2e62b8 test: fix topology_custom/test_mv_topology_change.py to work with tablets
test_mv_topology_change runs in gossip mode, so disable tablets as well.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
ff43f8d9f6 test: correct replication factor in test_multidc.py
In tablets mode, it is not allowed to CREATE a table
if replication factor can be satisfied. E.g. if the keyspace
is defined to have replication_factor = 3 and there
are only 2 replicas, in vnodes mode one still can
CREATE the table and write to it, whereas in tablets
mode one gets an error.

The confusion is what 'replication_factor' means.
When NetworkTopologyStrategy is used, in multi-dc mode, each DC must
have at least 'replication_factor' replicas and stores
'replication_factor' copies of data.

The test author (as well as the author of this "fix", see
my confused report of gh-21166) assumed that 'replication_factor'
means the total number of replicas, not the number of replicas
per DC.

Correct the test to use only one replica per DC, as this is the
topology the test is working with. The test is not specific
to the number of replicas, so the change does not impact
the logic of the test.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
1e582b4c0f test: update test_view_build_status to work with tablets
The test runs a bunch of tests in gossip only mode, which doesn't
work with tablets, so disable tablets explicitly in these tests.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
3e55f1c033 test: fix test_change_rpc_address with tablets.
With tablets, it's not allowed to create a table in a keyspace
which replication factor exceeds the actual number of nodes in the
cluster.

Pass the replication factor to random_tables fixture so that
a keyspace with a correct replication_factor is created.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
4b10c10c1b test: explicitly disable tablets in test_gropu0_schema_versioning
This is a gossip-based topology changes test, and tablets
don't work with gossip based topology.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
4aa7dca862 test: disable tablets in topology/test_mutation_schema_change.py
This test uses lightweight transactions, which are not enabled
with tablets keyspaces.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
2866b4f550 test: disable tablets in topology/test_mv.py
The test file contains two test cases, which both test
materialized view tombstone gc settings. With tablets the default
is "repair" which is different from vnodes.

The tests are testing that the gc settings are not inherited. With
tablets, the gc settings are forced. This is indistinguishable from
inheriting, so the tests are failing when run with tablets.
2024-12-16 08:38:05 -05:00
Botond Dénes
e6447f60c2 Merge 'db,auth,locator: Remove unused member variables' from Kefu Chai
this issue was identified by clang-20.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21835

* github.com:scylladb/scylladb:
  locator: remove unused member variable
  auth: remove unused member variable
  db: remove unused member variable
2024-12-16 15:16:17 +02:00
Kefu Chai
f2638c3d18 test: topology_custom: restrcuture comment as ordered list
When investigating issue #21724, the docstring for
`test_recover_stuck_raft_recovery` was found to be difficult to follow.
Restructured the docstring into an ordered list to:

1. Improve readability
2. Clearly outline the test steps
3. Make the test's logic and flow more immediately comprehensible

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21728
2024-12-16 14:30:13 +02:00
Pavel Emelyanov
7db9132b56 test: Add validation of getting/changing compaction strategy via REST API
The /column_family/compaction_strategy has GET and POST implemented, the
latter changes the strategy on the table.

Unknown strategy name implicitly renders internal server error code by
catching exception from compaction_strategy::type() that tries to
convert strategy name string to strategy enum class type.

This is to finish validation of #21533

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21569
2024-12-16 14:28:23 +02:00
Botond Dénes
34a8b492be Merge 'materialized view: make flow-control maximum delay configurable' from Piotr Dulikowski
This pull request is continuation of scylladb/scylladb#20688 - contents of the main commit are the same, the only change is the additional commit with a test.

Until this patch, the materialized view flow-control algorithm (https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/) used a constant delay_limit_us hard-coded to one second, which means that when the size of view-update backlog reached the maximum (10% of memory), we delay every request by an additional second - while smaller amounts of backlog will result in smaller delays.

This hard-coded one maximum second delay was considered *huge* - it will slow down a client with concurrency 1000 to just 1000 requests per second - but we already saw some workloads where it was not enough - such as a test workload running very slow reads at high concurrency on a slow machine, where a latency of over one second was expected for each read, so adding a one second latecy for writes wasn't having any noticable affect on slowing down the client.

So this patch replaces the hard-coded default with a live-updateable configuration parameter, `view_flow_control_delay_limit_in_ms`, which defaults to 1000ms as before.

Another useful way in which the new `view_flow_control_delay_limit_in_ms` can be used is to set it to 0. In that case, the view-update flow control always adds zero delay, and in effect - does absolutely nothing. This setting can be used in emergency situations where it is suspected that the MV flow control is not behaving properly, and the user wants to disable it.

The new parameter's help string mentions both these use cases of the parameter.

Fixes #18187

This is new functionality, no need to backport to any open source release.

Closes scylladb/scylladb#21647

* github.com:scylladb/scylladb:
  materialized views: test for the MV delay configuration parameter
  service: add injection for skipping view update backlog
  materialized view: make flow-control maximum delay configurable
2024-12-16 14:20:33 +02:00
Yaron Kaikov
2e6755ecca .github/scripts/auto-backport.py: Add comment to PR when conflicts apply
When we open a PR with conflicts, the PR owner gets a notification about the assignment but has no idea if this PR is with conflicts or not (in Scylla it's important since CI will not start on draft PR)

Let's add a comment to notify the user we have conflicts

Closes scylladb/scylladb#21939
2024-12-16 14:17:40 +02:00
Raphael S. Carvalho
013e0d53ff replica: Fix use-after-free due to a race between split and cleanup
There is an assumption that every destroyed compaction_group will be stopped first.
Otherwise, the group is still referenced by compaction manager and can use it after
freed. That's what happened in issue #21867 in the context of merge.

The issue is pre-existing but was made more likely with merge.

One problem is a race between split and cleanup, where if split is emitted while
cleanup is stopping groups, it can happen split preparation adds new groups that will
never be closed, since cleanup is already past the group stopping step.

Another problem found is that split completion handler is not accounting for possible
existence of merging groups, if split happens right after merge. Split completion
handler should stop all empty groups that previously had data split from them.

The problems will be fixed by guaranteeing that new groups will not be added for a
tablet being migrated away, and that empty groups are properly closed when handling
split completion.

A reproducer was added.

Fixes #21867.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#21920
2024-12-16 13:19:26 +02:00
Avi Kivity
fe9fcdfe30 task_manager.hh: replace boost ranges with std ranges
Standardize on one range library to reduce dependency load.

Unfortunately, std::views::concat (the replacement for boost::join),
is C++26 only. We use two separate inserts to the result vector to
compensate, and rationalize it by saying that boost::join() is likely
slow due to the need for type-erasure.

Closes scylladb/scylladb#21834
2024-12-16 13:08:02 +02:00
Artsiom Mishuta
e4dc86b552 fix(test.py): adjust break_manager method
remove unnecessary _mark_dirty call

server_broken_event - stop the whole file execution
(prevent the next tests from running because
Pyhon server object is broken PR: scylladb/scylladb#18236).
and next file execution will create its new cluster
so _mark_dirty will not change anything

Closes scylladb/scylladb#21429
2024-12-16 11:24:03 +01:00
Gleb Natapov
6890281486 locator: topology: remove no longer use get_all_ips() 2024-12-15 11:31:11 +02:00
Gleb Natapov
c2e3d875ab gossiper: change get_unreachable_nodes to host ids 2024-12-15 11:31:11 +02:00
Gleb Natapov
c39474cc7e locator: drop no longer used ip based functions from effective replication map and friends 2024-12-15 11:31:11 +02:00
Gleb Natapov
c5f1dc6293 test: move network_topology_strategy_test and token_metadata_test to use host id based APIs 2024-12-15 11:31:11 +02:00
Gleb Natapov
ca55d1e658 replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges 2024-12-15 11:31:11 +02:00
Gleb Natapov
77f8abb19a replica/mutation_dump: use host ids instead of ips 2024-12-15 11:31:11 +02:00
Gleb Natapov
38c13975ca alternator: move ttl to work with host ids instead of ips 2024-12-15 11:31:11 +02:00
Gleb Natapov
03c8ffa45c storage_service: move node_ops code to use host ids instead of host ips 2024-12-15 11:31:11 +02:00