Similarly to other API handlers, instead of using a database from http
context, patch the setting methods to capture the database from main
code and pass it around to handlers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).
However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.
This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.
The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.
Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.
Refs #25115Fixes#24625
Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.
Closesscylladb/scylladb#25151
* https://github.com/scylladb/scylladb:
raft_group0: split shutdown into abort_and_drain and destroy
Revert "main.cc: fix group0 shutdown order"
Nowadays the way to configure an internal service is
1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value
The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.
For the reference: similar changes with other services: #23705 , #20174 , #19166Closesscylladb/scylladb#25118
* github.com:scylladb/scylladb:
gms,init: Move get_disabled_features_from_db_config() from gms
code: Update callers generating feature service config
gms: Make feature_config a simple struct
gms: Split feature_config_from_db_config() into two
Previously, raft_group0::abort() was called in
storage_service::do_drain (introduced in #24418) to
stop the group0 Raft server before destroying local storage.
This was necessary because raft::server depends on storage
(via raft_sys_table_storage and group0_state_machine).
However, this caused issues: services like
sstable_dict_autotrainer and auth::service, which use
group0_client but are not stopped by storage_service,
could trigger use-after-free if raft_group0 was destroyed
too early. This can happen both during normal shutdown
and when 'nodetool drain' is used.
This commit reworks the shutdown logic:
* Introduces abort_and_drain(), which aborts the server
and waits for background tasks to finish, but keeps the
server object alive. Clients will see raft::stopped_error if
they try to access group0 after abort_and_drain().
* Final destruction happens in a separate method destroy(),
called later from main.cc.
The raft_server_for_group::aborted is changed to a
shared_future -- abort_server now returns a future so that
we can wait for it in abort_and_drain(), it should return
the future from the previous abort_server call, which can
happen in the on_background_error callback.
Node startup can fail before reaching storage_service,
in which case ss.drain_on_shutdown() and abort_and_drain()
are never called. To ensure proper cleanup,
abort_and_drain() is called from main.cc before destroy().
Clients of raft_group_registry are expected to call
destroy_server() for the servers they own. Currently,
the only such client is raft_group0, which satisfies
this requirement. As a result,
raft_group_registry::stop_servers() is no longer needed.
Instead, raft_group_registry::stop() now verifies that all
servers have been properly destroyed.
If any remain, it calls on_internal_error().
The call to drain_on_shutdown() in cql_test_env.cc
appears redundant. The only source of raft::server
instances in raft_group_registry is group0_service, and
if group0_service.start() succeeds, both abort_and_drain()
and destroy() are guaranteed to be called during shutdown.
We call paxos_store::ensure_initialized in the beginning of
storage_proxy::cas to create a paxos state table for a user table if
it doesn't exist. When the LWT coordinator sends RPCs to replicas,
some of them may not yet have the paxos schema. In
paxos_store::get_paxos_state_schema we just wait for them to appear,
or throw 'no_such_column_family' if the base table was dropped.
Introduce paxos_store abstraction to isolate Paxos state access.
Prepares for supporting either system.paxos or a co-located
table as the storage backend.
CPU scheduling has been with us since 641aaba12c
(2017), and no one ever disables it. Likely nothing really works without
it.
Make it mandatory and mark the option unused.
Closesscylladb/scylladb#24894
Instead of requesting it from gms code, create it "by hand" with the
help of get_disabled_features_from_db_config() method. This is how other
services are configured by main/tools/testing code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.
This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.
Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.
Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.
Fixes https://github.com/scylladb/scylladb/issues/24524
Backport not needed, as it is a new feature.
Closesscylladb/scylladb#24924
* github.com:scylladb/scylladb:
main: utils: add thread names to alien workers
auth: move passwords::check call to alien thread
test: wait for 3 clients with given username in test_service_level_api
auth: refactor password checking in password_authenticator
auth: make SHA-512 the only password hashing scheme for new passwords
auth: whitespace change in identify_best_supported_scheme()
auth: require scheme as parameter for `generate_salt`
auth: check password hashing scheme support on authenticator start
This commit adds a call to `pthread_setname_np` in
`alien_worker::spawn`, so each alien worker thread receives a
descriptive name. This makes debugging, monitoring, and performance
analysis easier by allowing alien workers to be clearly identified
in tools such as `perf`.
Analysis of customer stalls showed that the `detail::hash_with_salt`
function, called from `passwords::check`, often blocks the reactor.
This function internally uses the `crypt_r` function from an external
library to compute password hashes, which is a CPU-intensive operation.
To prevent such reactor stalls, this commit moves the
`passwords::check` call to a dedicated alien thread. This thread is
created at system startup and is shared by all shards.
Within the alien thread, an `std::mutex` synchronizes access between
the thread and the shards. While this could theoretically cause
frequent lock contentions, in practice, even during connection storms,
the number of new connections per second per shard is limited
(typically hundreds per second). Additionally, the
`_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not
too many connections are processed at once.
Fixesscylladb/scylladb#24524
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531Closesscylladb/scylladb#24886
[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]
* github.com:scylladb/scylladb:
test: add type creation to test_snapshot
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
- remove load_tablet_metadata(), instead we add wake_up_load_balancer flag
to update_tablet_metadata(), it reduces number of public functions and
also serves as a comment (removed comment with very similar meaning)
- reimplement the code to not use mutate_token_metadata(), this way
it's more readable and it's also needed as we'll split
update_tablet_metadata() in following commits so that we can have
subroutine which doesn't yield (for ensuring atomicity)
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.
It adds a `services/vector_store_client.{cc|hh}` sharded service and a
configuration parameter `vector_store_uri` with a
`http://vector-store.dns.name:port` format. If there will be an error
during parsing that parameter there will be an exception during
construction.
For the future unit testing purposes the patch adds
`vector_store_client_tester` as a way to inject mockup functionality.
This service will be used by the select statements for the Vector search
indexes (see VS-46). For this reason I've added vector_store_client
service in the query processor.
Reference: VS-47 VS-45
The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time.
The new algorithm can be summarized as follows:
1. Prepare a tablet_replica -> partition_range mapping where the values cover the entire space.
2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each
artition range (in such cases, the partition range is assigned to the appropriate tablet_replica).
In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host.
In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch.
In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range.
Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits.
This patch series:
- Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets).
- Implements tablet-aware dispatching algorithm.
- Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard.
- Adds test_long_query_timeout_erm to verify the new functionality.
Fixes: scylladb#21831
No backport, as it is rather new feature than a bugfix.
Closesscylladb/scylladb#24383
* github.com:scylladb/scylladb:
mapreduce: add missing comma and space in mapreduce_request operator<<
mapreduce: add shard_id_hint to mapreduce request
test: add test_long_query_timeout_erm
mapreduce: add tablet-aware dispatching algorithm
storage_proxy: make storage_proxy::is_alive public
mapreduce: remove _shared_token_metadata from mapreduce_service
mapreduce: move dispatching logic to dispatch_to_vnodes
mapreduce: remove underscores from variable names
mapreduce: move req_with_modified_pr handling to a new function
mapreduce: change next_vnode lambda to get_next_partition_range function
Before this change, `mapreduce_service` used `_shared_token_metadata`
to get the topology. However, the token was used in a part of the code
that already had its own ERM with its own metadata token. Moreover,
as mapreduce_service's token and ERM's token are not guaranteed to be
the same, inconsistencies could occur.
Therefore, this commit removes `_shared_token_metadata` and its usage.
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).
When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.
This reverts commit 0b516da95b, reversing
changes made to 30199552ac. It breaks
cluster.random_failures.test_random_failures.test_random_failures
in debug mode (at least).
Fixes#24513
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
- remove load_tablet_metadata(), instead we add wake_up_load_balancer flag
to update_tablet_metadata(), it reduces number of public functions and
also serves as a comment (removed comment with very similar meaning)
- reimplement the code to not use mutate_token_metadata(), this way
it's more readable and it's also needed as we'll split
update_tablet_metadata() in following commits so that we can have
subroutine which doesn't yield (for ensuring atomicity)
Currently, stream_manager is initialized after storage_service and
so it is stopped before the storage_service is. In its stop method
storage_service accesses stream_manager which is uninitialized
at a time.
Move stream_manager initialization over the storage_service initialization.
Fixes: #23207.
Closesscylladb/scylladb#24008
compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.
Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.
There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.
Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.
There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).
It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.
If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.
So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.
New functionality, added to a feature which isn't in any stable branch yet. No backporting.
Closesscylladb/scylladb#23590
* github.com:scylladb/scylladb:
test: add test/boost/sstable_compressor_factory_test
compress: add some test-only APIs
compress: rename sstable_compressor_factory_impl to dictionary_holder
compress: fix indentation
compress: remove sstable_compressor_factory_impl::_owner_shard
compress: distribute compression dictionaries over shards
test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
test: remove sstables::test_env::do_with()
test.py doesn't override stdin when starting Scylla, so when
tests are run from a terminal, isatty() returns true and
parsed command line output is not printed, which is inconvenient.
In this commit we add a check if the current process group
controls the stdin terminal. This serves two purposes:
* improves the "interactive mode" check from #scylladb/scylladb#18309,
as only the controlling process group can interact with the terminal.
* solves the test.py problem above, because test.py runs scylla in a new
session/process group (it calls setsid after fork), and is now
correctly not considered interactive.
Closesscylladb/scylladb#24047
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.
Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.
There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.
Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.
There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).
It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.
If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.
So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.
Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.
Fixes: scylladb/scylladb#23278Fixes: scylladb/scylladb#22869
Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga
Closesscylladb/scylladb#23800
* github.com:scylladb/scylladb:
doc: changing topology when changing snitches is no longer supported
test: cluster: introduce test_no_dc_rack_change
storage_service: don't update DC/rack in update_topology_with_local_metadata
main: make dc and rack immutable after bootstrap
test: cluster: remove test_snitch_change
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.
Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.
Fixes: scylladb/scylladb#23278
Can be used to query per-node stats about load as seen by the load
balancer.
In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
Move `object_storage.yaml` endpoints to `scylla.yaml`
This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.
Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.
This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f
Refs https://github.com/scylladb/scylladb/issues/22428Closesscylladb/scylladb#22952
* github.com:scylladb/scylladb:
Remove db::config::object_storage_config
Move `object_storage.yaml` endpoints to `scylla.yaml`
When a table is dropped, its corresponding dictionary in `system.dicts`
-- if any -- should be deleted, otherwise it will remain forever as
garbage.
This commit implements such cleanup.
Add an API call which will retrain the SSTable compression dictionary
for a given table.
Currently, it needs all nodes to be alive to succeed. We can relax this later.
Currently, there is at most one dictionary in `system.dicts`:
named "general", used by RPC compression. So the callback called
on `system.dicts` just always refreshes the RPC compression dict.
In a follow-up commit, we will publish SSTable compression dicts to
`system.dicts` rows with a name in the "sstables/{table_uuid}" format.
We want modification to such rows to be passed as new dictionary
recommendations to the SSTable compressor factory. This commit teaches
the `system.dicts` modification callback to recognize such modifications
and forward them to the compressor factory.
Extend the `system.dicts` helper for querying and modifying
`system.dicts` with an ability to use names other than "general".
We will use that in later commits to publish dictionaries for SSTable compression.
Before this patch, `system.dicts` contains only one dictionary, for RPC
compression, with the fixed name "general".
In later parts of this series, we will add more dictionaries to
system.dicts, one per table, for SSTable compression.
To enable that, this patch adjusts the callback mechanism for group0's `write_mutations`
command, so that the mutation callbacks for group0-managed tables can see which
partition keys were affected. This way, the callbacks can query only the
modified partitions instead of doing a full scan. (This is necessary to
prevent quadratic behaviours.)
For now, only the `system.dicts` callback uses the partition keys.
Create a `sstable_compressor_factory_impl` in `scylla_main`,
and pipe it through constructors into `sstables_manager`.
In next commits, the factory available through the `sstables_manager`
will be used to create compressors for SSTable readers and writers.
That map became redundant once we added
object_storage_endpoints in the config, this patch removes
it and switches all the user code to use the new option.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.
Fixes#23222
* requires backport on top of https://github.com/scylladb/scylladb/pull/23184Closesscylladb/scylladb#23306
* github.com:scylladb/scylladb:
main: allow abort during join_cluster
main: add checkpoint before joining cluster
storage_service: add start_sys_dist_ks
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.
The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
keyspaces are RF-rack-valid. If not, the initialization fails.
We provide tests verifying that the changes behave as intended.
---
Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.
---
Implementation strategy (going beyond the scope of this PR):
1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.
---
Fixesscylladb/scylladb#20356Fixesscylladb/scylladb#23276Fixesscylladb/scylladb#23300
---
Backport: this is part of the requirements for releasing 2025.1.
Closesscylladb/scylladb#23138
* github.com:scylladb/scylladb:
main: Refuse to start node when RF-rack-invalid keyspace exists
cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
db/config: Introduce RF-rack-valid keyspaces
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.
Fixesscylladb/scylladb#23300
This PR introduces the new Raft-based recovery procedure for group 0
majority loss.
The Raft-based recovery procedure works with tablets. The old
gossip-based recovery procedure does not because we have no code
for tablet migrations after the gossip-based topology changes.
The Raft-based procedure requires the Raft-based topology to be
enabled in the cluster. If the Raft-based topology is not enabled, the
gossip-based procedure must be used.
We will be able to get rid of the gossip-based procedure when we make
the Raft-based topology mandatory (we can do both in the same version,
2025.2 is the plan). Before we do it, we will have to keep both procedures
and explain when each of them should be used.
The idea behind the new procedure is to recreate group 0 without
touching the topology structures. Once we create a new group 0, we
can remove all dead nodes using the standard `removenode` and
`replace` operations.
For the procedure to be safe, we must ensure that each member of the
new group 0 moves to the same initial group 0 state. Also, the only safe
choice for the state is the latest persistent state available among the
live nodes.
The solution to the problem above is to ensure that the leader of the new
group 0 (called the recovery leader) is one of the nodes with the latest
state available. Other members will receive the snapshot from the
recovery leader when they join the new group 0 and move to its state.
Below is the shortened description of the new recovery procedure from
the perspective of the administrator. For the full description, refer to the
design document.
1. Find the set of live nodes.
2. Kill any live node that shouldn't be a member of the new group 0.
3. Ensure the full network connectivity between live nodes.
4. Rolling restart live nodes to ensure they are healthy and ready for
recovery.
5. Check if some data could have been lost. If yes, restore it from
backup after the recovery procedure.
6. Find the recovery leader (the node with the largest `group0_state_id`).
7. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the
recovery leader on each live node.
9. Rolling restart all live nodes, but the recovery leader must be
restarted first.
10. Remove all dead nodes using `removenode` or `replace`.
11. Unset `recovery_leader` on all nodes.
12. Delete data of the old group 0 from `system.raft`,
`system.raft_snaphots`, and `system.raft_snapshot_config`.
In the future, we could automate some of these steps or even introduce
a tool that will do all (or most) of them by itself. For now, we are fine with
a procedure that is reliable and simple enough.
This PR makes using 2025.1 with tablets much safer. We want to
backport it to 2025.1. We will also want to backport a few follow-ups.
Fixesscylladb/scylladb#20657Closesscylladb/scylladb#22286
* github.com:scylladb/scylladb:
test: mark tests with the gossip-based recovery procedure
test: add tests for the Raft-based recovery procedure
test: topology: util: fix the tokens consistency check for left nodes
test: topology: util: extend start_writes
gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
raft_group0: modify_raft_voter_status: do not add new members
treewide: allow recreating group 0 in the Raft-based recovery procedure
Refs #23017
When replaying a large commitlog from an unclean node, we can cause shard 0
db commitlog to reach footprint limit, and then remain there (because we
never release segments lower than limit). This is wasteful with diskspace.
But deleting segments early here is also wasteful; A better solution is
to simply give the segments to all CL shards, thus distributing the available
space.
v2:
* Do segement distribution using ranges. go c++23
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.
Fixes#23222
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>