Currently some things are not supported for colocated tables: it's not
possible to repair a colocated table, and due to this it's also not
possible to use the tombstone_gc=repair mode on a colocated table.
Extend the documentation to explain what colocated tables are and
document these restrictions.
Fixesscylladb/scylladb#27261Closesscylladb/scylladb#27516
Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier.
Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests.
Key changes:
- Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes are already dead)
- Updated documentation to reflect that both operations use this state
This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion.
The change is protected by "REMOVENODE_WITH_LEFT_TOKEN_RING" feature flag to also support mixed clusters during e.g. upgrade.
Fixes: scylladb/scylladb#25530
No backport: This fixes and issue found in tests. It can theoretically happen in production too, but wasn't reported in any customer issue, so a backport is not needed.
Closesscylladb/scylladb#26931
* https://github.com/scylladb/scylladb:
topology: make removenode use left_token_ring state for global barrier
topology: allow removing nodes not having tokens
features: add feature flag for removenode via left token ring
When learning a schema that has a linked cdc schema, we need to learn
also the cdc schema, and at the end the schema should point to the
learned cdc schema.
This is needed because the linked cdc schema is used for generating cdc
mutations, and when we process the mutations later it is assumed in some
places that the mutation's schema has a schema registry entry.
We fix a scenario where we could end up with a schema that points to a
cdc schema that doesn't have a schema registry entry. This could happen
for example if the schema is loaded before it is learned, so when we
learn it we see that it already has an entry. In that case, we need to
set the cdc schema to the learned cdc schema as well, because it could
have been loaded previously with a cdc schema that was not learned.
Fixesscylladb/scylladb#27610Closesscylladb/scylladb#27704
The test generates a staging sstable on a node and verifies whether
the view is correctly populated.
However view updates generated by a staging sstable
(`view_update_generator::generate_and_propagate_view_updates()`) aren't
awaited by sstable consumer.
It's possible that the view building coordinator may see the task as finished
(so the staging sstable was processed) but not all view updates were
writted yet.
This patch fixes the flakiness by waiting until
`scylla_database_view_update_backlog` drops down to 0 on all shards.
Fixesscylladb/scylladb#26683Closesscylladb/scylladb#27389
When calling a migration notification from the context of a notification
callback, this could lead to a deadlock with unregistering a listener:
A: the parent notification is called. it calls thread_for_each, where it
acquires a read lock on the vector of listeners, and calls the
callback function for each listener while holding the lock.
B: a listener is unregistered. it calls `remove` and tries to acquire a
write lock on the vector of listeners. it waits because the lock is
held.
A: the callback function calls another notification and calls
thread_for_each which tries to acquire the read lock again. but it
waits since there is a waiter.
Currently we have such concrete scenario when creating a table, where
the callback of `before_create_column_family` in the tablet allocator
calls `before_allocate_tablet_map`, and this could deadlock with node
shutdown where we unregister listeners.
Fix this by not acquiring the read lock again in the nested
notification. There is no need because the read lock is already held by
the parent notification while the child notification is running. We add
a function `thread_for_each_nested` that is similar to `thread_for_each`
except it assumes the read lock is already held and doesn't acquire it,
and it should be used for nested notifications instead of
`thread_for_each`.
Fixesscylladb/scylladb#27364Closesscylladb/scylladb#27637
Make the removenode operation go through the `left_token_ring` state,
similar to decommission. This ensures that when removenode completes,
all nodes in the cluster are aware of the topology change through a
global token metadata barrier.
Previously, removenode would skip the `left_token_ring` state and go
directly from `write_both_read_new` to `left` state. This meant that
when the operation completed, some nodes might not yet know about the
topology change, potentially causing issues with subsequent data plane
requests.
Key changes:
- Both decommission and removenode now transition to `left_token_ring`
state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the
shutdown RPC (removed nodes may already be dead)
- Updated documentation to reflect that both operations use this state
This change improves consistency guarantees for removenode operations
by ensuring cluster-wide awareness before completion.
Fixes: scylladb/scylladb#25530
For the changes to go through the left_token_ring state when
REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled, we need to allow
removing nodes to not have any tokens (similarly to decommissioning
nodes, which use the same sequence of states).
This means the tests also need to change to allow for this new behavior
- it can temporarily happen that a removing node has no tokens but is
still part of Raft group 0 (so there may be a temporary mismatch between
the token ring and group 0 membership).
Therefore, the `check_token_ring_and_group0_consistency` function is
replaced by `wait_for_token_ring_and_group0_consistency`, which waits
up to 30 seconds for consistency to be reached.
To improve the behavior of the removenode operation, we want to issue
a global topology barrier after the removenode has been applied.
However, this requires changing the topology state machine to add a new
state (left_token_ring) to the removenode flow, which is not supported
by older nodes.
To allow rolling upgrades, we add a feature flag
REMOVENODE_WITH_LEFT_TOKEN_RING that controls whether the new removenode
flow is used.
Currently the formatter converts it to json and then tries to emit into
the output context with the "...{{}}" format string. The intent was to
have the "...{<json text>}" output. However, the double curly brace in
format string means "print a curly brace", so the output of the above
formatting is "...{}", literally.
Fix by keeping a single curly brace. The "<json text>" thing will have
its own surrounding curly braces.
Fixes#27718
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27687
This action is triggered when a new milestone is created in scylladb.git
It will call the main logic which will create the same milestone as Jira
releases in the SCYLLADB and CUSTOMER Jira projects.
Fixes: PM-100
Closesscylladb/scylladb#27717
If a keyspace has a numeric replication factor in a DC and rf < #racks,
then the replicas of tablets in this keyspace can be distributed among
all racks in the DC (different for each tablet). With rack list, we need all
tablet replicas to be placed on the same racks. Hence, the conversion
requires tablet co-location.
After this series, the conversion can be done using ALTER KEYSPACE
statement. The statement that does this conversion in any DC is not
allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks
each and a keyspace ks then with a single ALTER KEYSPACE we can do:
- {dc1 : 2} -> {dc1 : [r1, r2]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: [r2,r3]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2}
- {dc1 : 2} -> {dc1 : 2, dc2 : [r1]}
But we cannot do:
- {dc1 : 2} -> {dc1 : [r1, r2, r3]};
- {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1].
In order to do the co-locations rf change request is paused. Tablet
load balancer examines the paused rf change requests and schedules
necessary tablet migrations. During the process of co-location, no other
cross-rack migration is allowed.
Load balancer checks whether any paused rf change request is
ready to be resumed. If so, it puts the request back to global topology
request queue.
While an rf change request for a keyspace is running, any other rf change
of this keyspace will fail.
Fixes: #26398.
New feature, no backport
Closesscylladb/scylladb#27279
* github.com:scylladb/scylladb:
test: add est_rack_list_conversion_with_two_replicas_in_rack
test: test creating tablet_rack_list_colocation_plan
test: add test_numeric_rf_to_rack_list_conversion test
tasks: service: add global_topology_request_virtual_task
cql3: statements: allow altering from numeric rf to rack list
service: topology_coordinator: pause keyspace_rf_change request
service: implement make_rack_list_colocation_plan
service: add tablet_rack_list_colocation_plan
cql3: reject concurrent alter of the same keyspace
test: check paused rf change requests persistence
db: service: add paused_rf_change_requests to system.topology
service: pass topology and system_keyspace to load_balancer ctor
service: tablet_allocator: extract load updates
service: tablet_allocator: extract ensure_node
tasks, system_keyspace: Introduce get_topology_request_entry_opt()
node_ops: Drop get_pending_ids()
node_ops: Drop redundant get_status_helper()
runner.py defines a command-line option `--extra-scylla-cmdline-options`
with the default type=str. However, the function `merge_cmdline_options`,
which consumes this value to merge command-line options from multiple
sources, expects a list of strings.
This mismatch results in the following exception:
```
raise ValueError(f'invalid argument name {name}, all args {args}')
ValueError: invalid argument name o, all args --logger-log-level repair=debug --default-log-level=error
```
when a test is run with pytest using:
`--extra-scylla-cmdline-options='--logger-log-level repair=debug --default-log-level=error'`
Fix this by handling the option consistently and calling `.split()`.
Also change the default value from an empty list to an empty string
to avoid confusion both in runner.py and test.py.
Closesscylladb/scylladb#27523
When translating Cassandra's unit tests, in a couple of places I accidentally used the same name for two tests, resulting in the first of each pair to never running.
Let's fix the name of the second of the each pair to be the real name it had in the original Cassandra test.
Closesscylladb/scylladb#27644
* github.com:scylladb/scylladb:
test/cqlpy: rename test with duplicate name
test/cqlpy: rename test with duplicate name
Fixing something that never bothered anyone but our automated "code quality" tool: there's an unnecessary call to "pass" in one of our tests. Just remove it.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27645
This test was observed to fail multiple times recently in promotion,
because there were successful reads. The failure only reproduces on
arm64, it doesn't reproduce on x86.
The suspected reason is that the data set is too close to the edge,
where all reads fail due to too high memory consumption. Reduce the
number of sstables used by this test to 54 (from 64).
Fixes: #27248Closesscylladb/scylladb#27650
Copilot found in test/alternator a bunch of places where we unnecessarily assign a variable that we don't use, or had a duplicated statement which doesn't do anything. This patch fixes all of them. AI still doesn't know how to prepare a patch that looks anything close to reasonable, so I did this part manually, and also carefully investigated each and every change (this took **a lot** of human time).
These patches don't change anything in the functionality of any of the tests. It's all cosmetic.
Closesscylladb/scylladb#27655
* github.com:scylladb/scylladb:
test/alternator: remove unnecessary duplicate statement
test/alternator: remove unused variable assignments
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.
Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.
Fixes#25919
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, the scheduling_group column was updated during the switch_tenant function, which meant the update occurred only after the tenant change operation completed—updating rows one by one. With this change, the scheduling_group column is now updated before the switch_tenant logic runs, ensuring that the table reflects the correct scheduling groups for all rows as early as possible.
fixes: #26060fixes: #27295
backport: not required
this is a minor bug fix. Internal logic worked but the user couldnt see the change in the table if they would read the system.clients table
Closesscylladb/scylladb#26404
* github.com:scylladb/scylladb:
test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft. The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode. Therefore, the test was moved from cqlpy to the cluster-based framework. This commit both adds the test in cluster/ and removes the old version in cqlpy/.
server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant. This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP). The lookup function is fully synchronized, non-coroutine, and returns immediately. For that reason, it’s better to perform this lookup outside of the switch_tenant function.
server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability. To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group. Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case. This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation. The new functionality will allow usage of service level cache
Add a service::topo::global_topology_request_virtual_task, which
covers the replication factor changes.
Currently, the global_topology_request_virtual_task can be aborted
only if it is paused.
The progress of the rf change isn't counted.
Allow altering from numeric replication factor to rack list. Ensure
that a single ALTER KEYSPACE statement doesn't try to both convert
to rack list and change rf.
To do the conversion from numeric rf to rack list, we need to co-locate
tablets of the keyspace, so that all of them have replicas on the same racks.
Pause the keyspace_rf_change global topology request, so that the co-location
could be done before the ALTER KEYSPACE changes are applied.
The pause is needed if in any dc rf changes from numeric to rack list
and the co-location is necessary. In this case we don't finish the request.
Instead, we add the request to the paused request vector. No migrations are
started.
The make_rack_list_colocation_plan consists of two phases.
In the first phase (realized with find_required_rack_list_colocations),
we find the pairs of (replica to be co-located, destination dc and rack).
We skip the pairs related to the tablets that are in transition or for
which the load balancer migration is already planned. We group the pairs
by destination dc and rack. Thanks to that in the second phase we can
calculate the least loaded nodes and shards only once for each rack.
In the second phase, we calculate the load of the nodes in a cluster
based on current transition and previously scheduled migrations.
We utilize the map created in the first phase and choose the least
loaded targets in each rack. We skip the tablets for which the
co-location was already scheduled.
find_required_rack_list_colocations isn't a method of load_balancer,
because in the following changes it is going to be reused by topology
coordinator to determine whether the rf change should be paused.
Add tablet_rack_list_colocation_plan. Keep it in migration_plan.
The plan includes a request that is ready to resume. There can be
more than one such request at the time, but we consider them one
by one for clarity of code. Rack list co-locations will be kept together
with normal load balancer migrations.
Consider normal load balancer migrations before rack list co-locations.
During rack list co-location, allow load balancer migrations to happen
only within a single rack. Do not create the merge co-location plan if
there is ongoing rack list co-location (if there are any rf changes paused).
Generate rf change resume based on the plan. Add _request_to_resume back
to global requests queue.
make_rack_list_colocation_plan will be implemented in the following change.
Reject ALTER KEYSPACE request if there is unfinished (queued, pending,
or paused) alter request of the same keyspace.
This is required as in the following changes, global request queue
will contain rf change requests meant to be resumed.
In the following changes, we allow to alter from numeric rf to rack list.
Before the alter, two tablets of the same keyspace can have replicas
on different racks. To switch to rack list, we need to co-locate
the replicas. It will be achieved by pausing the keyspace_rf_change
and scheduling migrations.
We need to persist the ids of requests that are paused. A new column -
paused_rf_change_requests is added to system.topology table.
In this commit no data is kept in the new column.
Pass a pointer to service::topology and db::system_keyspace to load
balancer. It will be used in the following patches to create
rack_list_colocation plan.
Every pending request should also have an entry in
system.topology_requests so it's redundant.
And problematic, because we cannot build a full request entry from
just an id alone, so if we would return those requests, they would
have blank information, and logic which needs more information would
not work.
`system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`.
This patch series contains:
- Introduction of `CLIENT_ROUTES` feature flag.
- Implementation of raft-based `system.client_routes` table
- Implementation of `v2/client-routes` POST/DELETE/GET endpoints
- Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed
- New tests that verifies the aforementioned features
Ref: scylladb/scylla-enterprise#5699
For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build.
Closesscylladb/scylladb#27323
* https://github.com/scylladb/scylladb:
docs: describe CLIENT_ROUTES_CHANGE extension
test: add test for CLIENT_ROUTES event
service: transport: add CLIENT_ROUTES_CHANGE event
test: add cluster tests for client routes
test: add API tests for client_routes endpoints
test: add `timeout` parameter to `delete` in RESTClient
test: allow json_body in send
api: implement client_routes endpoints
api: add client_routes.json
service: main: add client_routes_service
db: add system.client_routes table
gms: add CLIENT_ROUTES feature
This reverts commit 866c96f536, reversing
changes made to 367633270a.
This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.
Fixes: #27496Closesscylladb/scylladb#27518
To fix this issue, the std_list_iterator class defined within
std_list.__iter__ should implement the full iterator protocol by defining
an __iter__() method that returns self. This change ensures any instance
of std_list_iterator can be used as an iterator in Python for loops and
other iteration contexts, as required. The fix is to add a small method
definition inside the std_list_iterator class, ideally after the __init__
or in a logical place with the other dunder methods.
Only the code inside the std_list class's __iter__ function (lines around
the definition of the inner class and its methods) needs to be edited.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27642
Fixes#26744
If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category.
However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up.
The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected.
Closesscylladb/scylladb#27556
* github.com:scylladb/scylladb:
test::cluster::dtest::tools::files: Remove file
commitlog_replay: Handle fully corrupt files same as partial corruption.
test::pylib::suite::base: Split options.name test specifier only once
Fixes#17384
Bypasses enabling off-strategy storage/placement for repair streams
when table repaired is using tablets. Instead, the resulting sstable(s)
will be placed in the "normal" set of sstables, and bypass a post-repair
off-strategy compaction.
v2:
Bypass off-strat for whatever reason iff dest is tablets.
Closesscylladb/scylladb#27500
The test fails in CI sometimes, and we want a coredump from a failure
to debug that. We made the test send a `signal SIGSEGV` to Scylla
on failure, but apparently that doesn't work as intended on our CI
hosts. (The CI runner seemingly can't find any coredump afterwards).
We can use gdb's `gcore` command to produce a coredump in a more
predictable way.
Refs scylladb/scylladb#22501Closesscylladb/scylladb#27498
This series adds an xfailing reproducers for two issue: #8070 and #27037:
27037 is about where even with alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON representation - a Alternator Streams
event is unduly produced.
8070 is about the ability to write malformed values into the database and then fail during read - instead of failing, as expected, during the write. This issue was known for years, but we never really had a reproducer for it - it's not possible to reproduce it using clean boto3 code and we need to build a request manually.
The first two patches are two small cleanups (including fixes#27372) that I did while preparing the real tests - which are in the final two patches.
Closesscylladb/scylladb#27376
* github.com:scylladb/scylladb:
test/alternator: add reproducer for bug with storing invalid values
test/alternator: reproducer for issue 27375
utils/rjson: fix error messages from rjson::parse()
test/alternator: extract get_signed_request() to util.py