Commit Graph

599 Commits

Author SHA1 Message Date
Avi Kivity
66173c06a3 Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy
Remove support for generating numerical sstable generation for new sstables.
Loading such sstables is still supported but new sstables are always created with a uuid generation.
This is possible since:
* All live versions (since 5.4 / f014ccf369) now support uuid sstable generations.
* The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / 6da758d74c) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`.

Fixes #24248

* Enhancement, no backport needed

Closes scylladb/scylladb#24512

* github.com:scylladb/scylladb:
  streaming: stream_blob: use the table sstable_generation_generator
  replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator
  sstables: sstable_generation_generator: stop tracking highest generation
  replica: table: get rid of update_sstables_known_generation
  sstables: sstable_directory: stop tracking highest_generation
  replica: distributed_loader: stop tracking highest_generation
  sstables: sstable_generation: get rid of uuid_identifiers bool class
  sstables_manager: drop uuid_sstable_identifiers
  feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
  test: cql_query_test: add test_sstable_load_mixed_generation_type
  test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils
  test: database_test: move table_dir helper to test/lib/test_utils
2025-08-14 11:54:33 +03:00
Botond Dénes
614d17347a tombstone_gc: extract shared state into shared_tombstone_gc_state
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
2025-08-11 07:09:14 +03:00
Benny Halevy
7c9ce235d7 test: database_test: move table_dir helper to test/lib/test_utils
It's a generic helper that can be used by all tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-07 12:04:23 +03:00
Patryk Jędrzejczak
3299ffba51 Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).

However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.

This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.

The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.

Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.

Refs #25115

Fixes #24625

Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.

Closes scylladb/scylladb#25151

* https://github.com/scylladb/scylladb:
  raft_group0: split shutdown into abort_and_drain and destroy
  Revert "main.cc: fix group0 shutdown order"
2025-07-29 10:39:00 +02:00
Botond Dénes
f3ed27bd9e Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov
Nowadays the way to configure an internal service is

1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value

The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.

For the reference: similar changes with other services: #23705 , #20174 , #19166

Closes scylladb/scylladb#25118

* github.com:scylladb/scylladb:
  gms,init: Move get_disabled_features_from_db_config() from gms
  code: Update callers generating feature service config
  gms: Make feature_config a simple struct
  gms: Split feature_config_from_db_config() into two
2025-07-29 08:17:49 +03:00
Petr Gusev
8b8b7adbe5 raft_group0: split shutdown into abort_and_drain and destroy
Previously, raft_group0::abort() was called in
storage_service::do_drain (introduced in #24418) to
stop the group0 Raft server before destroying local storage.
This was necessary because raft::server depends on storage
(via raft_sys_table_storage and group0_state_machine).

However, this caused issues: services like
sstable_dict_autotrainer and auth::service, which use
group0_client but are not stopped by storage_service,
could trigger use-after-free if raft_group0 was destroyed
too early. This can happen both during normal shutdown
and when 'nodetool drain' is used.

This commit reworks the shutdown logic:
* Introduces abort_and_drain(), which aborts the server
and waits for background tasks to finish, but keeps the
server object alive. Clients will see raft::stopped_error if
they try to access group0 after abort_and_drain().
* Final destruction happens in a separate method destroy(),
called later from main.cc.

The raft_server_for_group::aborted is changed to a
shared_future -- abort_server now returns a future so that
we can wait for it in abort_and_drain(), it should return
the future from the previous abort_server call, which can
happen in the on_background_error callback.

Node startup can fail before reaching storage_service,
in which case ss.drain_on_shutdown() and abort_and_drain()
are never called. To ensure proper cleanup,
abort_and_drain() is called from main.cc before destroy().

Clients of raft_group_registry are expected to call
destroy_server() for the servers they own. Currently,
the only such client is raft_group0, which satisfies
this requirement. As a result,
raft_group_registry::stop_servers() is no longer needed.
Instead, raft_group_registry::stop() now verifies that all
servers have been properly destroyed.
If any remain, it calls on_internal_error().

The call to drain_on_shutdown() in cql_test_env.cc
appears redundant. The only source of raft::server
instances in raft_group_registry is group0_service, and
if group0_service.start() succeeds, both abort_and_drain()
and destroy() are guaranteed to be called during shutdown.
2025-07-25 17:16:14 +02:00
Petr Gusev
ac4bc3f816 paxos_state: lazily create paxos state table
We call paxos_store::ensure_initialized in the beginning of
storage_proxy::cas to create a paxos state table for a user table if
it doesn't exist. When the LWT coordinator sends RPCs to replicas,
some of them may not yet have the paxos schema. In
paxos_store::get_paxos_state_schema we just wait for them to appear,
or throw 'no_such_column_family' if the base table was dropped.
2025-07-24 19:48:08 +02:00
Petr Gusev
6e87a6cdb0 paxos_state: extract state access functions into paxos_store
Introduce paxos_store abstraction to isolate Paxos state access.
Prepares for supporting either system.paxos or a co-located
table as the storage backend.
2025-07-24 16:39:50 +02:00
Pavel Emelyanov
8220974e76 code: Update callers generating feature service config
Instead of requesting it from gms code, create it "by hand" with the
help of get_disabled_features_from_db_config() method. This is how other
services are configured by main/tools/testing code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-21 19:19:09 +03:00
Avi Kivity
c762425ea7 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start
2025-07-16 13:15:54 +03:00
Andrzej Jackowski
77a9b5919b main: utils: add thread names to alien workers
This commit adds a call to `pthread_setname_np` in
`alien_worker::spawn`, so each alien worker thread receives a
descriptive name. This makes debugging, monitoring, and performance
analysis easier by allowing alien workers to be clearly identified
in tools such as `perf`.
2025-07-15 23:29:21 +02:00
Andrzej Jackowski
9574513ec1 auth: move passwords::check call to alien thread
Analysis of customer stalls showed that the `detail::hash_with_salt`
function, called from `passwords::check`, often blocks the reactor.
This function internally uses the `crypt_r` function from an external
library to compute password hashes, which is a CPU-intensive operation.

To prevent such reactor stalls, this commit moves the
`passwords::check` call to a dedicated alien thread. This thread is
created at system startup and is shared by all shards.

Within the alien thread, an `std::mutex` synchronizes access between
the thread and the shards. While this could theoretically cause
frequent lock contentions, in practice, even during connection storms,
the number of new connections per second per shard is limited
(typically hundreds per second). Additionally, the
`_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not
too many connections are processed at once.

Fixes scylladb/scylladb#24524
2025-07-15 23:29:13 +02:00
Avi Kivity
6fce817aa8 Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.

Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().

Backport: no, it's a new feature

Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531

Closes scylladb/scylladb#24886

[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]

* github.com:scylladb/scylladb:
  test: add type creation to test_snapshot
  storage_service: always wake up load balancer on update tablet metadata
  db: schema_applier: call destroy also when exception occurs
  db: replica: simplify seeding ERM during shema change
  db: remove cleanup from add_column_family
  db: abort on exception during schema commit phase
  db: make user defined types changes atomic
  replica: db: make keyspace schema changes atomic
  db: atomically apply changes to tables and views
  replica: make truncate_table_on_all_shards get whole schema from table_shards
  service: split update_tablet_metadata into two phases
  service: pull out update_tablet_metadata from migration_listener
  db: service: add store_service dependency to schema_applier
  service: simplify load_tablet_metadata and update_tablet_metadata
  db: don't perform move on tablet_hint reference
  replica: split add_column_family_and_make_directory into steps
  replica: db: split drop_table into steps
  db: don't move map references in merge_tables_and_views()
  db: introduce commit_on_shard function
  db: access types during schema merge via special storage
  replica: make non-preemptive keyspace create/update/delete functions public
  replica: split update keyspace into two phases
  replica: split creating keyspace into two functions
  db: rename create_keyspace_from_schema_partition
  db: decouple functions and aggregates schema change notification from merging code
  db: store functions and aggregates change batch in schema_applier
  db: decouple tables and views schema change notifications from merging code
  db: store tables and views schema diff in schema_applier
  db: decouple user type schema change notifications from types merging code
  service: unify keyspace notification functions arguments
  db: replica: decouple keyspace schema change notifications to a separate function
  db: add class encapsulating schema merging
2025-07-13 20:47:55 +03:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Marcin Maliszkiewicz
fa157e7e46 db: service: add store_service dependency to schema_applier
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
2025-07-10 10:40:43 +02:00
Pawel Pery
7bf53fc908 vector_store_client: implement initial vector_store_client service
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It adds a `services/vector_store_client.{cc|hh}` sharded service and a
configuration parameter `vector_store_uri` with a
`http://vector-store.dns.name:port` format. If there will be an error
during parsing that parameter there will be an exception during
construction.

For the future unit testing purposes the patch adds
`vector_store_client_tester` as a way to inject mockup functionality.

This service will be used by the select statements for the Vector search
indexes (see VS-46). For this reason I've added vector_store_client
service in the query processor.

Reference: VS-47 VS-45
2025-07-08 16:29:55 +02:00
Andrzej Jackowski
9dbb1468b4 mapreduce: remove _shared_token_metadata from mapreduce_service
Before this change, `mapreduce_service` used `_shared_token_metadata`
to get the topology. However, the token was used in a part of the code
that already had its own ERM with its own metadata token. Moreover,
as mapreduce_service's token and ERM's token are not guaranteed to be
the same, inconsistencies could occur.

Therefore, this commit removes `_shared_token_metadata` and its usage.
2025-06-25 08:42:16 +02:00
Dawid Mędrek
c60035cbf6 test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
We've adjusted all of the Boost tests so they respect the invariant
enforced by the `rf_rack_valid_keyspaces` configuration option, or
explicitly disabled the option in those that turned out to be more
problematic and will require more attention. Thanks to that, we can
now enable it by default in the test suite.
2025-05-27 18:53:39 +02:00
Aleksandra Martyniuk
9c03255fd2 cql_test_env: main: move stream_manager initialization
Currently, stream_manager is initialized after storage_service and
so it is stopped before the storage_service is. In its stop method
storage_service accesses stream_manager which is uninitialized
at a time.

Move stream_manager initialization over the storage_service initialization.

Fixes: #23207.

Closes scylladb/scylladb#24008
2025-05-15 17:17:35 +03:00
Michał Chojnowski
1bcf77951c compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.

Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.

There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.

Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.

There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).

It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.

If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.

So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.
2025-05-07 14:43:18 +02:00
Pavel Emelyanov
eb5b52f598 Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.

Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.

Fixes: scylladb/scylladb#23278
Fixes: scylladb/scylladb#22869

Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga

Closes scylladb/scylladb#23800

* github.com:scylladb/scylladb:
  doc: changing topology when changing snitches is no longer supported
  test: cluster: introduce test_no_dc_rack_change
  storage_service: don't update DC/rack in update_topology_with_local_metadata
  main: make dc and rack immutable after bootstrap
  test: cluster: remove test_snitch_change
2025-04-21 15:52:55 +03:00
Piotr Dulikowski
ce2fab7cce main: make dc and rack immutable after bootstrap
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.

Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.

Fixes: scylladb/scylladb#23278
2025-04-17 16:22:26 +02:00
Tomasz Grabiec
0b9a75d7b6 virtual-tables: Introduce system.load_per_node
Can be used to query per-node stats about load as seen by the load
balancer.

In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
2025-04-09 20:21:51 +02:00
Michał Chojnowski
b77c611c00 raft/group0_state_machine: on system.dicts mutations, pass the affected partitition keys to the callback
Before this patch, `system.dicts` contains only one dictionary, for RPC
compression, with the fixed name "general".

In later parts of this series, we will add more dictionaries to
system.dicts, one per table, for SSTable compression.

To enable that, this patch adjusts the callback mechanism for group0's `write_mutations`
command, so that the mutation callbacks for group0-managed tables can see which
partition keys were affected. This way, the callbacks can query only the
modified partitions instead of doing a full scan. (This is necessary to
prevent quadratic behaviours.)

For now, only the `system.dicts` callback uses the partition keys.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
30a9d471fa sstables: plug an sstable_compressor_factory into sstables_manager
Create a `sstable_compressor_factory_impl` in `scylla_main`,
and pipe it through constructors into `sstables_manager`.

In next commits, the factory available through the `sstables_manager`
will be used to create compressors for SSTable readers and writers.
2025-04-01 00:07:28 +02:00
Dawid Mędrek
0e04a6f3eb main: Refuse to start node when RF-rack-invalid keyspace exists
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.

Fixes scylladb/scylladb#23300
2025-03-19 15:13:44 +01:00
Pavel Emelyanov
2bb455ec75 Merge 'Main: stop system_keyspace' from Benny Halevy
This series adds an async guard to system_keyspace operations
and adds a deferred action to stop the system_keyspace in main() before destroying the service.

This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped.

* Enhancement, no backport needed

Closes scylladb/scylladb#23113

* github.com:scylladb/scylladb:
  main: stop system keyspace
  system_keyspace: call shutdown from stop
  system_keyspace: shutdown: allow calling more than once
  database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
  utils: add class pluggable
2025-03-14 13:23:28 +03:00
Avi Kivity
696ce4c982 Merge "convert some parts of the gossiper to host ids" from Gleb
"
This is series starts conversion of the gossiper to use host ids to
index nodes. It does not touch the main map yet, but converts a lot of
internal code to host id. There are also some unrelated cleanups that
were done while working on the series. On of which is dropping code
related to old shadow round. We replaced shadow round with explicit
GOSSIP_GET_ENDPOINT_STATES verb in cd7d64f588
which is in scylla-4.3.0, so there should be no compatibility problem.
We already dropped a lot of old shadow round code in previous patches
anyway.

I tested manually that old and new node can co-exist in the same
cluster,
"

* 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits)
  gossiper: drop unneeded code
  gossiper: move _expire_time_endpoint_map to host_id
  gossiper: move _just_removed_endpoints to host id
  gossiper: drop unused get_msg_addr function
  messaging_service: change connection dropping notification to pass host id only
  messaging_service: pass host id to remove_rpc_client in down notification
  treewide: pass host id to endpoint_lifecycle_subscriber
  treewide: drop endpoint life cycle subscribers that do nothing
  load_meter: move to host id
  treewide: use host id directly in endpoint state change subscribers
  treewide: pass host id to endpoint state change subscribers
  gossiper: drop deprecated unsafe_assassinate_endpoint operation
  storage_service: drop unused code in handle_state_removed
  treewide: drop endpoint state change subscribers that do nothing
  gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory
  gossiper: start using host ids to send messages earlier
  messaging_service: add temporary address map entry on incoming connection
  topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
  treewide: move everyone to use host id based gossiper::is_alive and drop ip based one
  storage_proxy: drop unused template
  ...
2025-03-13 13:36:31 +02:00
Avi Kivity
b1d9f80d85 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
2025-03-11 14:34:27 +02:00
Gleb Natapov
f0af3f261e messaging_service: add temporary address map entry on incoming connection
We want to move to use host ids as soon as possible. Currently it is
possible only after the full gossiper exchange (because only at this
point gossiper state is added and with it address map entry). To make it
possible to move to host ids earlier this patch adds address map entries
on incoming communication during CLIENT_ID verb processing. The patch
also adds generation to CLIENT_ID to use it when address map is updated.
It is done so that older gossiper entries can be overwritten with newer
mapping in case of IP change.
2025-03-11 12:09:21 +02:00
Tomasz Grabiec
d01cc16d1e config, disk_space_monitor: Allow overriding capacity via config
Intended for testing, or hot-fixing out-of-space issues in production.

Tablet load balancer uses this information for determining per-shard load
so reducing capacity will cause tablets to be migrated away from the node.
2025-03-06 13:35:37 +01:00
Avi Kivity
28906c9261 Merge 'scylla-sstable: introduce the query command' from Botond Dénes
The scylla-sstable dump-* command suite has proven invaluable  in many investigations. In certain cases however, I found that `dump-data` is quite cumbersome. An example would be trying to find certain values in an sstable, or trying to read the content of system tables when a node is down. For these cases, `dump-data`  is very cumbersome: one has to trudge through tons of uninteresting metadata and do compaction in their heads. This PR introduces the new scylla-sstable query command, specifically targeted at situations like this: it allows executing queries on sstables, exposing to the user all the power of CQL, to tailor the output as they see fit.

Select everything from a table:

    $ scylla sstable query --system-schema /path/to/data/system_schema/keyspaces-*/*-big-Data.db
     keyspace_name                 | durable_writes | replication
    -------------------------------+----------------+-------------------------------------------------------------------------------------
            system_replicated_keys |           true |                         ({class : org.apache.cassandra.locator.EverywhereStrategy})
                       system_auth |           true |   ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 1})
                     system_schema |           true |                              ({class : org.apache.cassandra.locator.LocalStrategy})
                system_distributed |           true |   ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 3})
                            system |           true |                              ({class : org.apache.cassandra.locator.LocalStrategy})
                                ks |           true | ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})
                     system_traces |           true |   ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 2})
     system_distributed_everywhere |           true |                         ({class : org.apache.cassandra.locator.EverywhereStrategy})

Select everything from a single SSTable, use the JSON output (filtered through [jq](https://jqlang.github.io/jq/) for better readability):

    $ scylla sstable query --system-schema --output-format=json /path/to/data/system_schema/keyspaces-*/me-3gm7_127s_3ndxs28xt4llzxwqz6-big-Data.db | jq
    [
      {
        "keyspace_name": "system_schema",
        "durable_writes": true,
        "replication": {
          "class": "org.apache.cassandra.locator.LocalStrategy"
        }
      },
      {
        "keyspace_name": "system",
        "durable_writes": true,
        "replication": {
          "class": "org.apache.cassandra.locator.LocalStrategy"
        }
      }
    ]

Select a specific field in a specific partition using the command-line:

    $ scylla sstable query --system-schema --query "select replication from scylla_sstable.keyspaces where keyspace_name='ks'" ./scylla-workdir/data/system_schema/keyspaces-*/*-Data.db
     replication
    -------------------------------------------------------------------------------------
     ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})

Select a specific field in a specific partition using ``--query-file``:

    $ echo "SELECT replication FROM scylla_sstable.keyspaces WHERE keyspace_name='ks';" > query.cql
    $ scylla sstable query --system-schema --query-file=./query.cql ./scylla-workdir/data/system_schema/keyspaces-*/*-Data.db
     replication
    -------------------------------------------------------------------------------------
     ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})

New functionality: no backport needed.

Closes scylladb/scylladb#22007

* github.com:scylladb/scylladb:
  docs/operating-scylla: document scylla-sstable query
  test/cqlpy/test_tools.py: add tests for scylla-sstable query
  test/cqlpy/test_tools.py: make scylla_sstable() return table name also
  scylla-sstable: introduce the query command
  tools/utils: get_selected_operation(): use std::string for operation_options
  utils/rjson: streaming_writer: add RawValue()
  cql3/type_json: add to_json_type()
  test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread()
2025-03-06 13:42:45 +02:00
Tomasz Grabiec
7e7f1e6f91 storage_service, tablets: Collect per-node capacity in load_stats
New RPC is introduced becuase load_stats was marked "final" in the IDL.

Will be needed by capacity-aware load balancing.
2025-03-06 12:17:32 +01:00
Benny Halevy
7a624e3df8 system_keyspace: call shutdown from stop
and use that to replace the explicit shutdown when stopped
in cql_test_env.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:23 +02:00
Botond Dénes
5d63ef4d15 Merge 'scylla sstable: Add standard extensions and propagate to schema load ' from Calle Wilund
Fixes #22314

Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.

Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points.
Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line.

Closes scylladb/scylladb#22327

* github.com:scylladb/scylladb:
  tools: Add standard extensions and propagate to schema load
  cql_test_env: Use add all extensions instead of inidividually
  main: Move extensions adding to function
  tomstone_gc: Make validate work for tools
2025-02-26 13:52:47 +02:00
Tomasz Grabiec
f3b63bfeff test: cql_test_env: Expose db config 2025-02-19 16:29:08 +01:00
Botond Dénes
01a4d30d88 test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread()
This variant of do_with_cql_env(), forgoes the reentrancy support in the
regular do_with_cql_env() variants, and re-uses the caller's exsting
seastar thread. This is an optimized version for callers which don't
need reentrancy and already have a thread.
2025-02-17 08:01:38 -05:00
Pavel Emelyanov
5d1f74b86a main: Start sharded<view_builder> earlier
The view_builder service is needed by repair service, but is started
after it. It's OK in a sense that repair service holds a sharded
reference on it and checks whether local_is_initialized() before using
it, which is not nice.

Fortunately, starting sharded view buidler can be done early enough,
because most of its dependencies would be already started by that time.
Two exceptions are -- view_update_generator and
system_distributed_keyspace. Both can be moved up too with the same
justification.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-14 20:26:55 +03:00
Pavel Emelyanov
f650e75137 test/cql_env: Move stream manager start lower
This is to keep it in-sync with main code, where stream manager is
started after storage_proxy's and query_processor's remotes. This
doesn't change nothing for now, but next patches will move other
services around main/cql_test_env and early start of stream manager in
cql_test_env will be problematic.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-14 20:25:20 +03:00
Botond Dénes
4a7a75dfcb Merge 'tasks: use host_id in task manager' from Aleksandra Martyniuk
Use host_id in a children list of a task in task manager to indicate
a node on which the child was created.

Move TASKS_CHILDREN_REQUEST to IDL. Send it by host_id.

Fixes: https://github.com/scylladb/scylladb/issues/22284.

Ip to host_id transition; backport isn't needed.

Closes scylladb/scylladb#22487

* github.com:scylladb/scylladb:
  tasks: drop task_manager::config::broadcast_address as it's unused
  tasks: replace ip with host_id in task_identity
  api: task_manager: pass gossiper to api::set_task_manager
  tasks: keep host_id in task_manager
  tasks: move tasks_get_children to IDL
2025-02-11 11:32:27 +02:00
Tomasz Grabiec
3bb9d2fbdb test: cql_test_env: Expose topology_state_machine 2025-02-07 16:09:21 +01:00
Aleksandra Martyniuk
4470c2f6d3 tasks: keep host_id in task_manager
Keep host_id of a node in task manager. If host_id wasn't resolved
yet, task manager will keep an empty id.

It's a preparation for the following changes.
2025-02-05 10:10:29 +01:00
Kamil Braun
febd45861e test/lib: cql_test_env: make service shutdown more verbose
Introduce `defer_verbose_shutdown` in `cql_test_env` which logs
a message before and after shutting down a service, distinguishing
between success and failure.

The function is similar to the one in `main` but skips special error
handling logic applicable only to the main Scylla binary. The purpose
of the `cql_test_env` version of this function is only more verbose
logging. If necessary it can be extended in the future with additional
logic.

I estimated the impact on the size of produced log files using
`cdc_test` as an example:
```
$ build/dev/test/boost/combined_tests --run_test=cdc_test -- --smp=2 \
    >logfile 2>&1
$ du -b logfile
```

the result before this commit: 1964064 bytes, after: 2196432 bytes,
so estimated ~12% increase of log file size for boost tests that use
`cql_test_env`, assuming that the number of logs printed by each test is
similar to the logs printed by `cdc_test` (but I believe `cdc_test` is
one of the less verbose tests so this is an overestimate).

The motivation for this change is easier debugging of shutdown issues.
When investigating scylladb/scylladb#21983, where an exception is
thrown somewhere during the shutdown procedure, I found it hard to
pinpoint the service from which the exception originates. This change
will make it easier to debug issues like that by wrapping shutdown of
each service in a pair of messages logged when shutdown starts and when
it finishes (including when it fails). We should get more details on
this issue when it reproduces again in CI after this commit is merged
into `master`. (I failed to reproduce it locally with 1000 runs.)

Ref scylladb/scylladb#21983

Closes scylladb/scylladb#22566
2025-01-30 10:27:45 +03:00
Pavel Emelyanov
ca9b59f3b2 storage_service: Drop sys_dist_ks argument from join_cluster()
Storage service has _sys_dist_ks onboard and can just use it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-01-24 12:26:32 +03:00
Piotr Dulikowski
6aa962f5f4 Merge 'Add audit subsystem for database operations' from Paweł Zakrzewski
Introduces a comprehensive audit system to track database operations for security
and compliance purposes. This change includes:

Core Components:
- New audit subsystem for logging database operations
- Service level integration for proper resource management
- CQL statement tracking with operation categories
- Login process integration for tenant management

Key Features:
- Configurable audit logging (syslog/table)
- Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN)
- Selective auditing by keyspace/table
- Password sanitization in audit logs
- Service level shares support (1-1000) for workload prioritization
- Proper lifecycle management and cleanup

I ran the dtests for audit (manually enabled) and they pass.
The in-repo tests pass.

Notably, there should be no non-whitespace changes between this and scylla-enterprise

Fixes scylladb/scylla-enterprise#4999

Closes scylladb/scylladb#22147

* github.com:scylladb/scylladb:
  audit: Add shares support to service level management
  audit: Add service level support to CQL login process
  audit: Add support to CQL statements
  audit: Integrate audit subsystem into Scylla main process
  audit: Add documentation for the audit subsystem
  audit: Add the audit subsystem
2025-01-17 13:14:55 +01:00
Kamil Braun
89ee2a6834 Merge 'drop ip addresses from token metadata' from Gleb
Now that all topology related code uses host ids there is not point to
maintain ip to id (and back) mappings in the token metadata. After the
patch the mapping will be maintained in the gossiper only. The rest of
the system will use host ids and in rare cases where translation is
needed (mostly for UX compatibility reasons) the translation will be
done using gossiper.

Fixes: scylladb/scylla#21777

* 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits)
  hint manager: do not translate ip to id in case hint manager is stopped already
  locator: token_metadata: drop update_host_id() function that does nothing now
  locator: topology: drop indexing by ips
  repair: drop unneeded code
  storage_service: use host_id to look for a node in on_alive handler
  storage_proxy: translate ips to ids in forward array using gossiper
  locator: topology: remove unused functions
  storage_service: check for outdated ip in on_change notification in the peers table
  storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology
  topology coordinator: change connection dropping code to work on host ids
  cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query
  locator: drop unused function from tablet_effective_replication_map
  api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead
  locator: token_metadata: remove unused ip based functions
  locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs
  gossiper: drop get_unreachable_token_owners functions
  storage_service: use gossiper to map ip to id in node_ops operations
  storage_service: fix indentation after the last patch
  storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node
  token_metadata: drop no longer used functions
  ...
2025-01-17 11:00:52 +01:00
Gleb Natapov
50fb22c8f9 locator: topology: drop indexing by ips
Do not track id to ip mapping in the topology class any longer. There
are no remaining users.
2025-01-16 16:37:08 +02:00
Gleb Natapov
1b6e1456e5 messaging_service: drop the usage of ip based token_metadata APIs
We want to drop ips from token_metadata so move to use host id based
counterparts. Messaging service gets a function that maps from ips to id
when is starts listening.
2025-01-16 16:37:06 +02:00
Calle Wilund
00b40eada3 cql_test_env: Use add all extensions instead of inidividually 2025-01-15 12:08:09 +00:00
Paweł Zakrzewski
28bd699c51 audit: Add service level support to CQL login process
This change integrates service level functionality into the CQL authentication and connection handling:

- Add scheduling_group_name to client_data to track service level assignments
- Extend SASL challenge interface to expose authenticated username
- Modify connection processing to support tenant switching:
  - Add switch_tenant() method to handle scheduling group changes
  - Add process_until_tenant_switch() to handle request processing boundaries
  - Implement no_tenant() default executor
  - Add execute_under_tenant_type for scheduling group management

- Update connection lifecycle to properly handle service level changes:
  - Initialize connections with default scheduling group
  - Support dynamic scheduling group updates when service levels change
  - Ensure proper cleanup of scheduling group assignments

The changes enable proper scheduling group assignment and management based on
authenticated users' service levels, while maintaining backward compatibility
for connections without service level assignments.
2025-01-15 11:10:36 +01:00