Create raft/raft_fwd.hh with lightweight type aliases (server_id, group_id,
term_t, index_t) backed only by raft/internal.hh, avoiding the heavy
raft/raft.hh (832 lines with futures, abort_source, bytes_ostream).
Replace raft/raft.hh with raft/raft_fwd.hh in headers that only need the
basic ID types: tablets.hh, topology_state_machine.hh,
topology_coordinator.hh, storage_service.hh, group0_fwd.hh,
view_building_coordinator.hh, view_building_worker.hh.
Also remove gossiper.hh and tablet_allocator.hh from storage_service.hh
(forward declarations suffice), and remove unused reactor.hh from
tablets.hh. Add explicit includes in .cc files that lost transitive
availability.
Extract loaded_endpoint_state into a standalone lightweight header to
avoid pulling the heavy gossiper.hh (and transitively query-result-set.hh)
into every includer of system_keyspace.hh. Add explicit includes where
the full definitions are actually needed.
Reduces clean dev build time by ~2 minutes (-8%).
storage_proxy.hh included storage_service.hh but never referenced any
symbol from it. storage_service.hh costs 3.7s to parse per file, and
storage_proxy.hh has 75 direct includers. While most of those also
include database.hh (which shares transitive deps), removing this
unnecessary include still reduces total parse work.
Speedup: part of a series measured at -5.8% wall-clock improvement
(same-session A/B: 16m14s -> 15m17s at -j16, 16 cores).
Add explicit #include directives for headers that are currently
available transitively through cql3/query_processor.hh but will stop
being available after a subsequent refactoring that removes the
loading_cache include chain.
Files changed:
- cql3/statements/drop_keyspace_statement.cc: add unimplemented.hh
- cql3/statements/truncate_statement.cc: add unimplemented.hh
- cql3/statements/batch_statement.cc: add result_message.hh
- cql3/statements/broadcast_modification_statement.cc: add result_message.hh
- service/paxos/paxos_state.cc: add result_message.hh
- test/lib/cql_test_env.cc: add result_message.hh
- table_helper.cc: add result_message.hh
No functional change. Prepares for subsequent query_processor.hh cleanup.
implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine.
* tablet merge simply merges the histograms of segments of one compaction group with another.
* for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups.
* for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it.
* we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range.
Refs SCYLLADB-770
no backport - still a new and experimental feature
Closesscylladb/scylladb#29207
* github.com:scylladb/scylladb:
test: logstor: additional logstor tests
docs/dev: add logstor on-disk format section
logstor: add version and crc to buffer header
test: logstor: tablet split/merge and migration
logstor: enable tablet balancing
logstor: streaming of logstor segments using stream_blob
logstor: add take_logstor_snapshot
logstor: segment input/output stream
logstor: implement compaction_group::cleanup
logstor: tablet split
logstor: tablet merge
logstor: add compaction reenabler
logstor: add segment header
logstor: serialize writes to active segment
replica: extend compaction_group functions for logstor
replica: add compaction_group_for_logstor_segment
logstor: code cleanup
Since we do no longer support upgrade from versions that do not support
v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version.
v2 version was introduced by 8d25a4d678 which is included in scylla-2025.1.0.
No backport needed since this is code removal.
Closesscylladb/scylladb#29105
* github.com:scylladb/scylladb:
view: drop unused v1 builder code
view: remove upgrade to raft code
Motivation
----------
Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.
Description of solution
-----------------------
The situations we focus on are the following:
* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns
We handle each of them and provide validation tests.
Implementation strategy
-----------------------
1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.
Tests
-----
We provide tests that validate the correctness of the solution.
The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):
Before:
```
real 0m31.809s
user 1m3.048s
sys 0m21.812s
```
After:
```
real 0m34.523s
user 1m10.307s
sys 0m27.223s
```
The incremental differences in time can be found in the commit messages.
Fixes SCYLLADB-429
Backport: not needed. This is an enhancement to an experimental feature.
Closesscylladb/scylladb#28526
* github.com:scylladb/scylladb:
service: strong_consistency: Abort state_machine::apply when aborting server
service: strong_consistency: Abort ongoing operations when shutting down
service: client_state: Extend with abort_source
service: strong_consistency: Handle abort when removing Raft group
service: strong_consistency: Abort Raft operations on timeout
service: strong_consistency: Use timeout when mutating
service: strong_consistency: Fix indentation
service: strong_consistency: Enclose coordinator methods with try-catch
service: strong_consistency: Crash at unexpected exception
test: cluster: Extract default config & cmdline in test_strong_consistency.py
The PR contains more code cleanups, mostly in gossiper. Dropping more gossiper state leaving only NORMAL and SHUTDOWN. All other states are checked against topology state. Those two are left because SHUTDOWN state is propagated through gossiper only and when the node is not in SHUTDOWN it should be in some other state.
No need to backport. Cleanups.
Closesscylladb/scylladb#29129
* https://github.com/scylladb/scylladb:
storage_service: cleanup unused code
storage_service: simplify get_peer_info_for_update
gossiper: send shutdown notifications in parallel
gms: remove unused code
virtual_tables: no need to call gossiper if we already know that the node is in shutdown
gossiper: print node state from raft topology in the logs
gossiper: use is_shutdown instead of code it manually
gossiper: mark endpoint_state(inet_address ip) constructor as explicit
gossiper: remove unused code
gossiper: drop last use of LEFT state and drop the state
gossiper: drop unused STATUS_BOOTSTRAPPING state
gossiper: rename is_dead_state to is_left since this is all that the function checks now.
gossiper: use raft topology state instead of gossiper one when checking node's state
storage_service: drop check_for_endpoint_collision function
storage_service: drop is_first_node function
gossiper: remove unused REMOVED_TOKEN state
gossiper: remove unused advertise_token_removed function
Add system.tablets to the set of system resources that can be
accessed with the VECTOR_SEARCH_INDEXING permission.
Fixes: VECTOR-605
Closesscylladb/scylladb#29397
The decommission sets left gossiper state only to prevent shutdown
notification be issued by the node during shutdown. Since the
notification code now checks the state in raft topology this is no
longer needed.
The state machine used by strongly consistent tablets may block on a
read barrier if the local schema is insufficient to resolve pending
mutations [1]. To deal with that, we perform a read barrier that may
block for a long time.
When a strongly consistent tablet is being removed, we'd like to cancel
all ongoing executions of `state_machine::apply`: the shard is no
longer responsible for the tablet, so it doesn't matter what the outcome
is.
---
In the implementation, we abort the operations by simply throwing
an exception from `state_machine::apply` and not doing anything.
That's a red flag considering that it may lead to the instance
being killed on the spot [2].
Fortunately for us, strongly consistent tables use the default Raft
server implementation, i.e. `raft::server_impl`, which actually
handles one type of an exception thrown by the method: namely,
`abort_requested_exception`, which is the default exception thrown
by `seastar::abort_source` [3]. We leverage this property.
---
Unfortunately, `raft::server_impl::abort` isn't perfectly suited for
us. If we look into its code, we'll see that the relevant portion of
the procedure boils down to three steps:
1. Prevent scheduling adding new entries.
2. Wait for the applier fiber.
3. Abort the state machine.
Since aborting the state machine happens only after the applier fiber
has already finished, there will no longer be anything to abort. Either
all executions of `state_machine::apply` have already finished, or they
are hanging and we cannot do anything.
That's a pre-existing problem that we won't be solving here (even
though it's possible). We hope the problem will be solved, and it seems
likely: the code suggests that the behavior is not intended. For more
details, see e.g. [4].
---
We provide two validation tests. They simulate the abortion of
`state_machine::apply` in two different scenarios:
* when the table is dropped (which should also cover the case of tablet
migration),
* when the node is shutting down.
The value of the tests isn't high since they don't ensure that the
state of the group is still valid (though it should be), nor do they
perform any other check. Instead, we rely on the testing framework to
spot any anomalies or errors. That's probably the best we can do at
the moment.
Unfortunately, both tests are marked as skipped becuause of the current
limitations of `raft::server_impl::abort` described above and in [4].
References:
[1] 4c8dba1
[2] See the description of `raft::state_machine` in `raft/raft.hh`.
[3] See `server_impl::applier_fiber` in `raft/server.cc`.
[4] SCYLLADB-1056
These changes are complementary to those from a recent commit where we
handled aborting ongoing operations during tablet events, such as
tablet migration. In this commit, we consider the case of shutting down
a node.
When a node is shutting down, we eventually close the connections. When
the client can no longer get a response from the server, it makes no
sense to continue with the queries. We'd like to cancel them at that
point.
We leverage the abort source passed down via `client_state` down to
the strongly consistent coordinator. This way, the transport layer can
communicate with it and signal that the queries should be canceled.
The abort source is triggered by the CQL server (cf.
`generic_server::server::{stop,shutdown}`).
---
Note that this is not an optional change. In fact, if we don't abort
those requests, we might hang for an indefinite amount of time when
executing the following code in `main.cc`:
```
// Register at_exit last, so that storage_service::drain_on_shutdown will be called first
auto do_drain = defer_verbose_shutdown("local storage", [&ss] {
ss.local().drain_on_shutdown().get();
});
```
The problem boils down to the fact that `generic_server::server::stop`
will wait for all connections to be closed, but that won't happen until
all ongoing operations (at least those to strongly consistent tables)
are finished.
It's important to highlight that even though we hang on this, the
client can no longer get any response. Thus, it's crucial that at that
point we simply abort ongoing operations to proceed with the rest of
shutdown.
---
Two tests are added to verify that the implementation is correct:
one focusing on local operations, the other -- on a forwarded write.
Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:
Before:
```
real 0m31.775s
user 1m4.475s
sys 0m22.615s
```
After:
```
real 0m32.024s
user 1m10.751s
sys 0m23.871s
```
Individual runs of the added tests:
test_queries_when_shutting_down:
```
real 0m12.818s
user 0m36.726s
sys 0m4.577s
```
test_abort_forwarded_write_upon_shutdown:
```
real 0m12.930s
user 0m36.622s
sys 0m4.752s
```
We make `client_state` store a pointer to an `abort_source`. This will
be useful in the following commit that will implement aborting ongoing
requests to strongly consistent tables upon connection shutdowns.
It might also be useful in some other places in the code in the future.
We set the abort source for client states in relevant places.
When a strongly consistent Raft group is being removed, it means one of
the following cases:
(A) The node is shutting down and it's simply part of the the shutdown
procedure.
(B) The tablet is somehow leaving the replica. For example, due to:
- Tablet migration
- Tablet split/merge
- Tablet removal (e.g. because the table is dropped)
In this commit, we focus on case (A). Case (B) will be handled in the
following one.
---
The changes in the code are literally none, and there's a reason to it.
First, let's note that we've already implemented abortion of timed-out
requests. There is a limit to how long a query can run and sooner or
later it will finish, regardless of what we do.
Second, we need to ask ourselves if the cases we're considering in this
commit (i.e. case (B)) is a situation where we'd like to speed up the
process. The answer is no.
Tablet migrations are effectively internal operations that are invisible
to the users. User requests are, quite obviously, the opposite of that.
Because of that, we want to patiently wait for the queries to finish or
time out, even though it's technically possible to lead to an abort
earlier.
Lastly, the changes in the code that actually appear in this commit are
not completely irrelevant either. We consider the important case of
the `leader_info_updater` fiber and argue that it's safe to not pass
any abort source to the Raft methods used by it.
---
Unfortunately, we don't have tablet migrations implemented yet [1],
so our testing capabilities are limited. Still, we provide a new test
that corresponds to case (B) described above. We simulate a tablet
migration by dropping a table and observe how reads and writes behave
in such a situation. There's no extremely careful validation involved
there, but that's what we can have for the time being.
Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:
Before:
```
real 0m30.841s
user 1m3.294s
sys 0m21.091s
```
After:
```
real 0m31.775s
user 1m4.475s
sys 0m22.615s
```
The time spent on the new test only:
```
real 0m5.264s
user 0m34.646s
sys 0m3.374s
```
References:
[1] SCYLLADB-868
If a query, either a write, or a read to a strongly consistent table,
times out, we immediately abort the operation and throw an exception.
Unfortunately, due to the inconsistency in exception types thrown
on timeout by the many methods we use in the code, it results in
pretty messy `try-catch` clauses. Perhaps there's a better alternative
to this, but it's beyond the scope of this work, so we leave it as-is.
We provide a validation test that consists of three cases corresponding
to reads, writes, and waiting for the leader. They verify that the code
works as expected in all affected places.
A comparison of time spent on the whole `test_strong_consistency.py` on
my local machine, in dev mode:
Before:
```
real 0m32.185s
user 0m55.391s
sys 0m15.745s
```
After:
```
real 0m30.841s
user 1m3.294s
sys 0m21.091s
```
The time spent on the new test only:
```
real 0m7.077s
user 0m35.359s
sys 0m3.717s
```
We remove the inconsistency between reads and writes to strongly
consistent tables. Before the commit, only reads used a timeout.
Now, writes do as well.
Although the parameter isn't used yet, that will change in the following
commit. This is a prerequisite for it.
We enclose `coordinator::{mutate,query}` with `try-catch` clauses. They
do nothing at the moment, but we'll use them later. We do this now to
avoid noise in the upcoming commits.
We'll fix the indentation in the following commit.
The loop shouldn't throw any other exception than the ones already
covered by the `catch` claues. Crash, at least when
`abort_on_internal_error` is set, if we catch any other type since
that may be a sign of a bug.
While working on benchmarks for strong consistency we noticed that the raft logic attempted to take snapshots during the benchmark. Snapshot transfer is not implemented for strong consistency yet and the methods that take or transfer snapshots throw exceptions. This causes the raft groups to stop working completely.
While implementing snapshot transfers is out of scope, we can implement some mitigations now to stop the tests from breaking:
- The first commit adjusts the configuration options. First, it disables periodic snapshotting (i.e. creating a snapshot every X log entries). Second, it increases the memory threshold for the raft log before which a snapshot is created from 2MB to 10MB.
- The second commit relaxes the take snapshot / drop snapshot methods and makes it possible to actually use them - they are no-ops. It is still forbidden to transfer snapshots.
I am including both commits because applying only the first one didn't completely prevent the issue from occurring when testing locally.
Refs: SCYLLADB-1115
Strong consistency is experimental, no need for backport.
Closesscylladb/scylladb#29189
* github.com:scylladb/scylladb:
strong_consistency: fake taking and dropping snapshots
strong_consistency: adjust limits for snapshots
Currently we don't support 'local' consistency, which would
imply maintaining separate raft group for each dc. What we
support is actually 'global' consistency -- one raft group
per tablet replica set. We don't plan to support local
consistency for the first GA.
Closesscylladb/scylladb#29221
The endpoint in question has some places worth fixing, in particular
- the keyspace parameter is not validated
- the validated table name is resolved into table_id, but the id is unused
- two ugly static helpers to stream obtained token ranges into json
Improving the API code flow, not backporting
Closesscylladb/scylladb#29154
* github.com:scylladb/scylladb:
api: Inline describe_ring JSON handling
storage_service: Make describe_ring_for_table() take table_id
The vector-search feature introduced the somewhat confusing feature of
enabling CDC without explicitly enabling CDC: When a vector index is
enabled on a table, CDC is "enabled" for it even if the user didn't
ask to enable CDC.
For this, write-path code began to use a new cdc_enabled() function
instead of checking schema.cdc_options.enabled() directly. This
cdc_enabled() function checks if either this enabled() is true, or
has_vector_index() is true.
Unfortunately, LWT writes continued to use cdc_options.enabled() instead
of the new cdc_enabled(). This means that if a vector index is used and
a vector is written using an LWT write, the new value is not indexed.
This patch fixes this bug. It also adds a regression test that fails
before this patch and passes afterwards - the new test verifies that
when a table has a vector index (but no explicit CDC enabled), the CDC
log is updated both after regular writes and after successful LWT writes.
This patch was also tested in the context of the upcoming vector-search-
for-Alternator pull request, which has a test reproducing this bug
(Alternator uses LWT frequently, so this is very important there).
It will also be tested by the vector-store test suite ("validator").
Fixes SCYLLADB-1342
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29300
When a table has a vector index, cdc::cdc_enabled() returns true because
vector index writes are implemented via the CDC augmentation path. However,
register_cdc_operation_result_tracker() was checking only
cdc_options().enabled(), which is false for tables that have a vector index
but not traditional CDC.
As a result, the operation_result_tracker was never attached to write
response handlers for vector-indexed tables. This tracker was added in
commit 1b92cbe, and its job is to update metrics of CDC operations,
and since vector search really does use CDC under the hood, these
metrics could be useful when diagnosing problems.
Fix by using cdc::cdc_enabled() instead of cdc_options().enabled(), which
covers both traditional CDC and vector-indexed tables.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29343
This PR introduces the vnodes-to-tablets migration procedure, which enables converting an existing vnode-based keyspace to tablets.
The migration is implemented as a manual, operator-driven process executed in several stages. The core idea is to first create tablet maps with the same token boundaries and replica hosts as the vnodes, and then incrementally convert the storage of each node to the tablets layout. At a high level, the procedure is the following:
1. Create tablet maps for all tables in the keyspace.
2. Sequentially upgrade all nodes from vnodes to tablets:
1. Mark a node for upgrade in the topology state.
2. Restart the node. During startup, while the node is offline, it reshards the SSTables on vnode boundaries and switches to a tablet ERM.
3. Wait for the node to return online before proceeding to the next node.
4. Finalize the migration:
1. Update the keyspace schema to mark it as tablet-based.
2. Clear the group0 state related to the migration.
From the client's perspective, the migration is online; the cluster can still serve requests on that keyspace, although performance may be temporarily degraded.
During the migration, some nodes use vnode ERMs while others use tablet ERMs. Cluster-level algorithms such as load balancing will treat the keyspace's tables as vnode-based. Once migration is finalized, the keyspace is permanently switched to tablets and cannot be reverted back to vnodes. However, a rollback procedure is available before finalization.
The patch series consists of:
* Load balancer adjustments to ignore tablets belonging to a migrating keyspace.
* A new vnode-based resharding mode, where SSTables are segregated on vnode boundaries rather than with the static sharder.
* A new per-node `intended_storage_mode` column in `system.topology`. Represents migration intent (whether migration should occur on restart) and direction.
* Four new REST endpoints for driving the migration (start, node upgrade/downgrade, finalize, status), along with `nodetool` wrappers. The finalization is implemented as a global topology request.
* Wiring of the migration process into the startup logic: the `distributed_loader` determines a migrating table's ERM flavor from the `intended_storage_mode` and the ERM flavor determines the `table_populator`'s resharding mode. Token metadata changes have been adjusted to preserve the ERM flavor.
* Cluster tests for the migration process.
Fixes SCYLLADB-722.
Fixes SCYLLADB-723.
Fixes SCYLLADB-725.
Fixes SCYLLADB-779.
Fixes SCYLLADB-948.
New feature, no backport is needed.
Closesscylladb/scylladb#29065
* github.com:scylladb/scylladb:
docs: Add ops guide for vnodes-to-tablets migration
test: cluster: Add test for migration of multiple keyspaces
test: cluster: Add test for error conditions
test: cluster: Add vnodes->tablets migration test (rollback)
test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes)
test: cluster: Add vnodes->tablets migration test (1 table, 1 node)
scylla-nodetool: Add migrate-to-tablets subcommand
api: Add REST endpoint for vnode-to-tablet migration status
api: Add REST endpoint for migration finalization
topology_coordinator: Add `finalize_migration` request
database: Construct migrating tables with tablet ERMs
api: Add REST endpoint for upgrading nodes to tablets
api: Add REST endpoint for starting vnodes-to-tablets migration
topology_state_machine: Add intended_storage_mode to system.topology
distributed_loader: Wire vnode-based resharding into table populator
replica: Pick any compaction group for resharding
compaction: resharding_compaction: add vnodes_resharding option
storage_service: Preserve ERM flavor of migrating tables
tablet_allocator: Exclude migrating tables from load balancing
feature_service: Add vnodes_to_tablets_migrations feature
implement tablet migration for logstor tables by streaming segments
using stream_blob, similar to file streaming of sstables.
take a snapshot of the logstor segments and create a stream_blob_info
vector with entry for each segment with the input stream that reads the
segment and an op of type file_ops::stream_logstor_segments.
the stream_blob_handler creates a logstor sink that allocates a segment
on the target shard and creates an output stream that writes to it. when
the sink is closed it loads the segment.
A joining node hung forever if the topology coordinator added it to the
group 0 configuration before the node reached `post_server_start`. In
that case, `server->get_configuration().contains(my_id)` returned true
and the node broke out of the join loop early, skipping
`post_server_start`. `_join_node_group0_started` was therefore never set,
so the node's `join_node_response` RPC handler blocked indefinitely.
Meanwhile the topology coordinator's `respond_to_joining_node` call
(which has no timeout) hung forever waiting for the reply that never came.
Fix by only taking the early-break path when not starting as a follower
(i.e. when the node is the discovery leader or is restarting). A joining
node must always reach `post_server_start`.
We also provide a regression test. It takes 6s in dev mode.
Fixes SCYLLADB-959
Closesscylladb/scylladb#29266
If the keyspace is migrating, it reports the intended and actual storage
mode for each node.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The endpoint is the following:
POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}/finalization
When called, it issues a `finalize_migration` topology request and waits
for its completion.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Vnodes-to-tablets migration needs a finalization step to finish or
rollback the migration. Finishing the migration involves switching the
keyspace schema to tablets and clearing the `intended_storage_mode` from
system.topology. Rolling back the migration involves deleting the tablet
maps and clearing the `intended_storage_mode`.
The finalization needs to be done as a topology request to exclude with
other operations such as repair and TRUNCATE.
This patch introduces the `finalize_migration` global topology request
for this purpose. The request takes a keyspace name as an argument.
The direction of the finalization (i.e., forward path vs rollback) is
inferred from the `intended_storage_mode` of all nodes (not ideal,
should be made explicit).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The endpoint is the following:
POST /storage_service/vnode_tablet_migrations/node/storage_mode?intended_mode={tablets,vnodes}
This endpoint is part of the vnodes-to-tablets migration process and
controls a node's intended_storage_mode in system.topology. The storage
mode represents the node-local data distribution model, i.e., how data
are organized across shards. The node will apply the intended storage
mode to migrating tables upon next restart by resharding their SSTables
(either on vnode boundaries if intended_mode=tablets, or with the static
sharder if intended_mode=vnodes).
Note that this endpoint controls the intended_storage_mode of the local
node only. This has the nice benefit that once the API call returns, the
change has not only been committed to group0 but also applied to the
local node's state machine. This guarantees that the change is part of
the node's local copy upon next restart; no additional read barrier is
needed.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The endpoint is the following:
POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}
Its purpose is to start the migration of a whole keyspace from vnodes to
tablets.
When called, Scylla will synchronously create a tablet map for each
table in the specified keyspace. The tablet maps of all tables are
identical and they mirror the vnode layout; they contain one tablet per
vnode and each tablet uses the same replica hosts and token boundaries
as the corresponding vnode.
The only difference from vnodes lies in the sharding approach. Tablets
are assigned to a single shard - using a round-robin strategy in this
patch - whereas vnodes are distributed evenly across all shards. If the
tablet count per shard is low and tablet sizes are uneven, or some
shards have more tablets than others, performance may degrade during the
migration process. For example, a cluster with i8g.48xlarge (192 vCPUs),
256 vnodes per node and RF=3 will have 256 * 3 / 192 vCPUs = 4 tablet
replicas per shard during the migration. One additional tablet or a
double-sized tablet would cause 25% overcommit.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Part of the vnodes-to-tablets migration is to reshard the SSTables of
each node on vnode boundaries. Resharding is a heavy operation that
runs on startup while the node is offline. Since nodes can restart
for unexpected reasons, we need a flag to do it in a controllable way.
We also need the ability to roll back the migration, which requires
resharding in the opposite direction. This means a node must be aware of
the intended migration direction.
To address both requirements, this patch introduces a new column,
intended_storage_mode, in system.topology. A non-null value indicates
that a node should perform a migration and specifies the migration
direction.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
When a table is migrating from vnodes to tablets, the cluster is in a
mixed state where some nodes use vnode ERMs and others use tablet ERMs.
The ERM flavor is a node-local property that expresses the node's
storage organization.
Preserve the flavor across token metadata changes. The flavor needs to
be on par with storage, but the storage can change only on startup, as
it requires resharding all SSTables to conform with the flavor.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The tablet load balancer operates on all tablet-based tables that appear
in the tablet metadata.
With the introduction of the vnodes-to-tablets migration procedure later
in this series, migrating tables will also appear in the tablet
metadata, but they need to be treated as vnode tables until migration is
finished. This patch excludes such tables from load balancing.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Commit faa0ee9844 accidentally broke the way split snapshot mutation was
frozen -- instead of appending the sub-mutation `m` the commit kept the
old variable name of `mut` which in the new code corresponds to "old"
non-split mutation
Fixes#29051
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29052
Snapshots are not implemented yet for strong consistency - attempting to
take, transfer or drop a snapshot results in an exception. However, the
logic of our state machine forces snapshot transfer even if there are no
lagging replicas - every raft::server::configuration::snapshot_threshold
log entries. We have actually encountered an issue in our benchmarks
where snapshots were being taken even though the cluster was not under
any disruption, and this is one of the possible causes.
It turns out that we can safely allow for taking snapshots right now -
we can just implement it as a no-op and return a random UUID.
Conversely, dropping a snapshot can also be a no-op. This is safe
because snapshot transfer still throws an exception - as long as the
taken/recovered snapshots are never attempted to be transferred.
Raft snapshots are not implemented yet for strong consistency. Adjust
the current raft group config to make them much less likely to occur:
- snapshot_threshold config option decides how many log entries need to
be applied after the last snapshot. Set it to the maximum value for
size_t in order to effectively disable it.
- snapshot_threshold_log_size defines a threshold for the log memory
usage over which a snapshot is created. Increase it from the default
2MB to 10MB.
- max_log_size defines the threshold for the log memory usage over which
requests are stopped to be admitted until the log is shrunk back by a
snapshot. Set it to 20MB, as this option is recommended to be at least
twice as much as snapshot_threshold_log_size.
Refs: SCYLLADB-1115
All callers already have it. It makes no difference for the method
itself with which table identifier to work, but will help to simplify
the flow in API handler (next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.
Also, graceful shutdown will be blocked on it.
Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.
Reason for deadlock:
When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.
If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this
Initial state:
cg0: main
cg1: main
cg2: main
cg3: main
After 1st merge:
cg0': main [locked], merging_groups=[cg0.main, cg1.main]
cg1': main [locked], merging_groups=[cg2.main, cg3.main]
After 2nd merge:
cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]
merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.
The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.
Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.
Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.
Fixes SCYLLADB-928
Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.
Closesscylladb/scylladb#29007
* github.com:scylladb/scylladb:
tablets: Fix deadlock in background storage group merge fiber
replica: table: Propagate old erm to storage group merge
test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
storage_service: Extract local_topology_barrier()
This patchset migrates: query_all_directly_granted, query_all,
get_attribute, query_attribute_for_all functions to use cache
instead of doing CQL queries. It also includes some preparatory
work which fixes cache update order and triggering.
Main motivation behind this is to make sure that all calls
from service_level_controller::auth_integration are cached,
which we achieve here.
Alternative implementation could move the whole auth_integration
data into auth cache but since auth_integration manages also lifetime
and contains service levels specific logic such solution would be
too complex for little (if any) gain.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159
Backport: no, not a bug
Closesscylladb/scylladb#28791
* github.com:scylladb/scylladb:
auth: switch query_attribute_for_all to use cache
auth: switch get_attribute to use cache
auth: cache: add heterogeneous map lookups
auth: switch query_all to use cache
auth: switch query_all_directly_granted to use cache
auth: cache: add ability to go over all roles
raft: service: reload auth cache before service levels
service: raft: move update_service_levels_effective_cache check
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus, finished request does not imply that a node is removed from
the topology.
Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.
Modify the get_children method to ignore the exception and warn
about the failure instead.
Keep token_metadata_ptr in get_children to prevent topology from changing.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867
Needs backports to all versions
Closesscylladb/scylladb#29035
* github.com:scylladb/scylladb:
tasks: fix indentation
tasks: do not fail the wait request if rpc fails
tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
Since we do no longer support upgrade from versions that do not support
v2 of view building code we can remove upgrade code and make sure we do
not boot with old builder version.