Commit Graph

1518 Commits

Author SHA1 Message Date
Patryk Jędrzejczak
3299ffba51 Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).

However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.

This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.

The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.

Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.

Refs #25115

Fixes #24625

Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.

Closes scylladb/scylladb#25151

* https://github.com/scylladb/scylladb:
  raft_group0: split shutdown into abort_and_drain and destroy
  Revert "main.cc: fix group0 shutdown order"
2025-07-29 10:39:00 +02:00
Botond Dénes
f3ed27bd9e Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov
Nowadays the way to configure an internal service is

1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value

The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.

For the reference: similar changes with other services: #23705 , #20174 , #19166

Closes scylladb/scylladb#25118

* github.com:scylladb/scylladb:
  gms,init: Move get_disabled_features_from_db_config() from gms
  code: Update callers generating feature service config
  gms: Make feature_config a simple struct
  gms: Split feature_config_from_db_config() into two
2025-07-29 08:17:49 +03:00
Nadav Har'El
b4fc3578fc Merge 'LWT: enable for tablet-based tables' from Petr Gusev
This PR enables **LWT (Lightweight Transactions)** support for tablet-based tables by leveraging **colocated tables**.

Currently, storing Paxos state in system tables causes two major issues:
* **Loss of Paxos state during tablet migration or base table rebuilds**
  * When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state.
  * This breaks LWT correctness in certain scenarios.
  * Failing test cases demonstrating this:
      * test_lwt_state_is_preserved_on_tablet_migration
      * test_lwt_state_is_preserved_on_rebuild
* **Shard misalignment and performance overhead**
  * Tablets may be placed on arbitrary shards by the tablet balancer.
  * Accessing Paxos state in system tables could require a shard jump, degrading performance.

We move Paxos state into a dedicated Paxos table, colocated with the base table:
  * Each base table gets its own Paxos state table.
  * This table is lazily created on the first LWT operation.
  * Its tablets are colocated with those of the base table, ensuring:
    * Co-migration during tablet movement
    * Co-rebuilding with the base table
    * Shard alignment for local access to Paxos state

Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2].

This PR addresses two issues from the "Why doesn't it work for tablets" section  in [1]:
  * Tablet migration vs LWT correctness
  * Paxos table sharding

Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs.

References
[1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu)
[2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0)
[3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886)
[4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251)

Backport: not needed because this is a new feature.

Closes scylladb/scylladb#24819

* github.com:scylladb/scylladb:
  create_keyspace: fix warning for tablets
  docs: fix lwt.rst
  docs: fix tablets.rst
  alternator: enable LWT
  random_failures: enable execute_lwt_transaction
  test_tablets_lwt: add test_paxos_state_table_permissions
  test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft
  test_tablets_lwt: test timeout creating paxos state table
  test_tablets_lwt: add test_lwt_concurrent_base_table_recreation
  test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild
  test_tablets_lwt: migrate test_lwt_support_with_tablets
  test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration
  test_tablets_lwt: add simple test for LWT
  check_internal_table_permissions: handle Paxos state tables
  client_state: extract check_internal_table_permissions
  paxos_store: handle base table removal
  database: get_base_table_for_tablet_colocation: handle paxos state table
  paxos_state: use node_local_only mode to access paxos state
  query_options: add node_local_only mode
  storage_proxy: handle node_local_only in query
  storage_proxy: handle node_local_only in mutate
  storage_proxy: introduce node_local_only flag
  abstract_replication_strategy: remove unused using
  storage_proxy: add coordinator_mutate_options
  storage_proxy: rename create_write_response_handler -> make_write_response_handler
  storage_proxy: simplify mutate_prepare
  paxos_state: lazily create paxos state table
  migration_manager: add timeout to start_group0_operation and announce
  paxos_store: use non-internal queries
  qp: make make_internal_options public
  paxos_store: conditional cf_id filter
  paxos_store: coroutinize
  feature_service: add LWT_WITH_TABLETS feature
  paxos_state: inline system_keyspace functions into paxos_store
  paxos_state: extract state access functions into paxos_store
2025-07-28 13:19:23 +03:00
Avi Kivity
1930f3e67f Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski
Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space.

However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position.

For example, if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first index entry after key `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.)

Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges.

Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome.

Preparation for new functionality, no backporting needed.

Closes scylladb/scylladb#25093

* github.com:scylladb/scylladb:
  sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
  test/boost: add a test for inexact index lookups
  sstables/mx/reader: allow passing a custom index reader to the constructor
  sstables/index_reader: remove advance_to
  sstables/mx/reader: handle inexact lookups in `advance_context()`
  sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()`
  sstables/index_reader: make the return value of `get_partition_key` optional
  sstables/mx/reader: handle "backward jumps" in forward_to
  sstables/mx/reader: filter out partitions outside the queried range
  sstables/mx/reader: update _pr after `fast_forward_to`
2025-07-27 19:39:36 +03:00
Petr Gusev
8b8b7adbe5 raft_group0: split shutdown into abort_and_drain and destroy
Previously, raft_group0::abort() was called in
storage_service::do_drain (introduced in #24418) to
stop the group0 Raft server before destroying local storage.
This was necessary because raft::server depends on storage
(via raft_sys_table_storage and group0_state_machine).

However, this caused issues: services like
sstable_dict_autotrainer and auth::service, which use
group0_client but are not stopped by storage_service,
could trigger use-after-free if raft_group0 was destroyed
too early. This can happen both during normal shutdown
and when 'nodetool drain' is used.

This commit reworks the shutdown logic:
* Introduces abort_and_drain(), which aborts the server
and waits for background tasks to finish, but keeps the
server object alive. Clients will see raft::stopped_error if
they try to access group0 after abort_and_drain().
* Final destruction happens in a separate method destroy(),
called later from main.cc.

The raft_server_for_group::aborted is changed to a
shared_future -- abort_server now returns a future so that
we can wait for it in abort_and_drain(), it should return
the future from the previous abort_server call, which can
happen in the on_background_error callback.

Node startup can fail before reaching storage_service,
in which case ss.drain_on_shutdown() and abort_and_drain()
are never called. To ensure proper cleanup,
abort_and_drain() is called from main.cc before destroy().

Clients of raft_group_registry are expected to call
destroy_server() for the servers they own. Currently,
the only such client is raft_group0, which satisfies
this requirement. As a result,
raft_group_registry::stop_servers() is no longer needed.
Instead, raft_group_registry::stop() now verifies that all
servers have been properly destroyed.
If any remain, it calls on_internal_error().

The call to drain_on_shutdown() in cql_test_env.cc
appears redundant. The only source of raft::server
instances in raft_group_registry is group0_service, and
if group0_service.start() succeeds, both abort_and_drain()
and destroy() are guaranteed to be called during shutdown.
2025-07-25 17:16:14 +02:00
Michał Chojnowski
141895f9eb sstables/index_reader: make the return value of get_partition_key optional
BTI indexes only store encoded prefixes of partition keys,
not the whole keys. They can't reliably implement `get_partition_key`.
The index reader interface must be weakened and callers must
be adapted.
2025-07-25 11:00:18 +02:00
Ernest Zaslavsky
d2c5765a6b treewide: Move keys related files to a new keys directory
As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system.

Moved files:
- clustering_bounds_comparator.hh
- keys.cc
- keys.hh
- clustering_interval_set.hh
- clustering_key_filter.hh
- clustering_ranges_walker.hh
- compound_compat.hh
- compound.hh
- full_position.hh

Fixes: #22102
Fixes: #22103
Fixes: #22105

Closes scylladb/scylladb#25082
2025-07-25 10:45:32 +03:00
Petr Gusev
ac4bc3f816 paxos_state: lazily create paxos state table
We call paxos_store::ensure_initialized in the beginning of
storage_proxy::cas to create a paxos state table for a user table if
it doesn't exist. When the LWT coordinator sends RPCs to replicas,
some of them may not yet have the paxos schema. In
paxos_store::get_paxos_state_schema we just wait for them to appear,
or throw 'no_such_column_family' if the base table was dropped.
2025-07-24 19:48:08 +02:00
Petr Gusev
6e87a6cdb0 paxos_state: extract state access functions into paxos_store
Introduce paxos_store abstraction to isolate Paxos state access.
Prepares for supporting either system.paxos or a co-located
table as the storage backend.
2025-07-24 16:39:50 +02:00
Pavel Emelyanov
8220974e76 code: Update callers generating feature service config
Instead of requesting it from gms code, create it "by hand" with the
help of get_disabled_features_from_db_config() method. This is how other
services are configured by main/tools/testing code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-21 19:19:09 +03:00
Botond Dénes
20693edb27 Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski
This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to).

In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface.
Later, we will add BTI indexes which will also implement this interface.

This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`.

Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes.

No backports needed, this is a preparation for new functionality.

Closes scylladb/scylladb#25000

* github.com:scylladb/scylladb:
  sstables: add sstable::make_index_reader() and use where appropriate
  sstables/mx: in readers, use abstract_index_reader instead of index_reader
  sstables: in validate(), use abstract_index_reader instead of index_reader where possible
  test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
  sstables/index_reader: introduce abstract_index_reader
  sstables/index_reader: extract a prefetch_lower_bound() method
2025-07-17 14:32:08 +03:00
Michał Chojnowski
92219a5ef8 test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
We don't want tests to create the concrete `index_reader` directly. We
would like them to be able to test both sstables which use
`index_reader`, and those which will use the planned new index implementation.
So we will let the tests construct an abstract_index_reader and pass it
to the index_reader_assertions, which will be able to assert the requested
properties on various implementations as it wants.
2025-07-17 10:32:56 +02:00
Calle Wilund
5dd871861b tests::proc::process_fixture: Fix line handler adaptor buffering
Fixes #24998

Helper routine translating input_stream buffers to single lines
did not loop over current buffer state, leading to only the first
line being sent to end listener.

Rewrote to use range iteration instead. Nicer.

Closes scylladb/scylladb#24999
2025-07-17 10:58:03 +03:00
Avi Kivity
c762425ea7 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start
2025-07-16 13:15:54 +03:00
Andrzej Jackowski
77a9b5919b main: utils: add thread names to alien workers
This commit adds a call to `pthread_setname_np` in
`alien_worker::spawn`, so each alien worker thread receives a
descriptive name. This makes debugging, monitoring, and performance
analysis easier by allowing alien workers to be clearly identified
in tools such as `perf`.
2025-07-15 23:29:21 +02:00
Andrzej Jackowski
9574513ec1 auth: move passwords::check call to alien thread
Analysis of customer stalls showed that the `detail::hash_with_salt`
function, called from `passwords::check`, often blocks the reactor.
This function internally uses the `crypt_r` function from an external
library to compute password hashes, which is a CPU-intensive operation.

To prevent such reactor stalls, this commit moves the
`passwords::check` call to a dedicated alien thread. This thread is
created at system startup and is shared by all shards.

Within the alien thread, an `std::mutex` synchronizes access between
the thread and the shards. While this could theoretically cause
frequent lock contentions, in practice, even during connection storms,
the number of new connections per second per shard is limited
(typically hundreds per second). Additionally, the
`_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not
too many connections are processed at once.

Fixes scylladb/scylladb#24524
2025-07-15 23:29:13 +02:00
Botond Dénes
26f135a55a Merge 'Make KMIP host do nice TLS close on dropped connection + make PyKMIP test fixure not generate TLS noise + remove boost::process' from Calle Wilund
Fixes #24873

In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors.

This just makes sure `release`  closes the connection if it neither retains or caches it.

Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors.
Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs.

This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency.

Closes scylladb/scylladb#24874

* github.com:scylladb/scylladb:
  encryption_test: Make PyKMIP run under seastar::experimental::process
  test/lib: Add wrapper helper for test process fixtures
  kmip_host: Close connections properly if dropped by pool being full
  encryption_at_rest_test: Do port check using TLS
2025-07-15 06:55:34 +03:00
Calle Wilund
253323bb64 test/lib: Add wrapper helper for test process fixtures
Adds a wrapper for seastar::experimental::process, to help
use external process fixtures in unit test. Mainly to share
concepts such as line reading of stdout/err etc, and sync
the shutdown of these. Also adds a small path searcher to
find what you want to run.
2025-07-14 12:18:16 +00:00
Avi Kivity
6fce817aa8 Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.

Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().

Backport: no, it's a new feature

Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531

Closes scylladb/scylladb#24886

[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]

* github.com:scylladb/scylladb:
  test: add type creation to test_snapshot
  storage_service: always wake up load balancer on update tablet metadata
  db: schema_applier: call destroy also when exception occurs
  db: replica: simplify seeding ERM during shema change
  db: remove cleanup from add_column_family
  db: abort on exception during schema commit phase
  db: make user defined types changes atomic
  replica: db: make keyspace schema changes atomic
  db: atomically apply changes to tables and views
  replica: make truncate_table_on_all_shards get whole schema from table_shards
  service: split update_tablet_metadata into two phases
  service: pull out update_tablet_metadata from migration_listener
  db: service: add store_service dependency to schema_applier
  service: simplify load_tablet_metadata and update_tablet_metadata
  db: don't perform move on tablet_hint reference
  replica: split add_column_family_and_make_directory into steps
  replica: db: split drop_table into steps
  db: don't move map references in merge_tables_and_views()
  db: introduce commit_on_shard function
  db: access types during schema merge via special storage
  replica: make non-preemptive keyspace create/update/delete functions public
  replica: split update keyspace into two phases
  replica: split creating keyspace into two functions
  db: rename create_keyspace_from_schema_partition
  db: decouple functions and aggregates schema change notification from merging code
  db: store functions and aggregates change batch in schema_applier
  db: decouple tables and views schema change notifications from merging code
  db: store tables and views schema diff in schema_applier
  db: decouple user type schema change notifications from types merging code
  service: unify keyspace notification functions arguments
  db: replica: decouple keyspace schema change notifications to a separate function
  db: add class encapsulating schema merging
2025-07-13 20:47:55 +03:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Marcin Maliszkiewicz
19bc6ffcb0 replica: make truncate_table_on_all_shards get whole schema from table_shards
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.

Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
fa157e7e46 db: service: add store_service dependency to schema_applier
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
2025-07-10 10:40:43 +02:00
Pawel Pery
7bf53fc908 vector_store_client: implement initial vector_store_client service
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It adds a `services/vector_store_client.{cc|hh}` sharded service and a
configuration parameter `vector_store_uri` with a
`http://vector-store.dns.name:port` format. If there will be an error
during parsing that parameter there will be an exception during
construction.

For the future unit testing purposes the patch adds
`vector_store_client_tester` as a way to inject mockup functionality.

This service will be used by the select statements for the Vector search
indexes (see VS-46). For this reason I've added vector_store_client
service in the query processor.

Reference: VS-47 VS-45
2025-07-08 16:29:55 +02:00
Avi Kivity
1e0b015c8b Merge 'cql3: Represent create_statement using managed_bytes' from Dawid Mędrek
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by

```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```

in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.

However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.

In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `bytes_ostream` and `managed_bytes` to
possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.

A reproducer is intentionally not provided by this commit:
it's easy to test the change, but adding and dropping
a huge number of columns would take a really long amount
of time, so we need to omit it.

Fixes scylladb/scylladb#24018

Backport: all of the supported versions are affected, so we want to backport the changes there.

Closes scylladb/scylladb#24151

* github.com:scylladb/scylladb:
  cql3/description: Serialize only rvalues of description
  cql3: Represent create_statement using managed_string
  cql3/statements/describe_statement.cc: Don't copy descriptions
  cql3: Use managed_bytes instead of bytes in DESCRIBE
  utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
2025-07-01 21:59:38 +03:00
Dawid Mędrek
ac9062644f cql3: Represent create_statement using managed_string
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by

```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```

in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.

However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.

In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `fragmented_ostringstream` and `managed_string`
to possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.

We provide a reproducer. It consistently passes with this commit,
while having about 50% chance of failure before it (based on my
own experiments). Playing with the parameters of the test
doesn't seem to improve that chance, so let's keep it as-is.

Fixes scylladb/scylladb#24018
2025-07-01 12:58:02 +02:00
Avi Kivity
b33dd2bd7d Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

Closes scylladb/scylladb#24492

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-06-29 18:18:36 +03:00
Piotr Dulikowski
62efe6616a Merge 'mapreduce: add tablet-aware dispatching algorithm' from Andrzej Jackowski
The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time.

The new algorithm can be summarized as follows:
 1. Prepare a tablet_replica -> partition_range mapping where the values     cover the entire space.
 2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each
artition range (in such cases, the partition range is assigned to the appropriate tablet_replica).

In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host.

In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch.

In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range.

Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits.

This patch series:
 - Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets).
 - Implements tablet-aware dispatching algorithm.
 - Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard.
 - Adds test_long_query_timeout_erm to verify the new functionality.

Fixes: scylladb#21831

No backport, as it is rather new feature than a bugfix.

Closes scylladb/scylladb#24383

* github.com:scylladb/scylladb:
  mapreduce: add missing comma and space in mapreduce_request operator<<
  mapreduce: add shard_id_hint to mapreduce request
  test: add test_long_query_timeout_erm
  mapreduce: add tablet-aware dispatching algorithm
  storage_proxy: make storage_proxy::is_alive public
  mapreduce: remove _shared_token_metadata from mapreduce_service
  mapreduce: move dispatching logic to dispatch_to_vnodes
  mapreduce: remove underscores from variable names
  mapreduce: move req_with_modified_pr handling to a new function
  mapreduce: change next_vnode lambda to get_next_partition_range function
2025-06-26 12:25:39 +02:00
Andrzej Jackowski
9dbb1468b4 mapreduce: remove _shared_token_metadata from mapreduce_service
Before this change, `mapreduce_service` used `_shared_token_metadata`
to get the topology. However, the token was used in a part of the code
that already had its own ERM with its own metadata token. Moreover,
as mapreduce_service's token and ERM's token are not guaranteed to be
the same, inconsistencies could occur.

Therefore, this commit removes `_shared_token_metadata` and its usage.
2025-06-25 08:42:16 +02:00
Botond Dénes
aae212a87c test/lib/cql_assertions: introduce columns_assertions
To enable targeted and optionally typed assertions against individual
columns in a row.
2025-06-25 08:41:29 +03:00
Botond Dénes
ebd9420687 sstables: add corrupt_data_handler to sstables::sstables
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.
2025-06-25 08:41:26 +03:00
Benny Halevy
15bee9f232 sstables: sstable_generation_generator: set last_generation=0 by default
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
f0f7c83705 test: lib: test_env: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Benny Halevy
0310a03de6 test: sstable_test: always use uuid sstable generation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-06-18 11:30:29 +03:00
Avi Kivity
cd79a8fc25 Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz"
This reverts commit 0b516da95b, reversing
changes made to 30199552ac. It breaks
cluster.random_failures.test_random_failures.test_random_failures
in debug mode (at least).

Fixes #24513
2025-06-16 22:38:12 +03:00
Tomasz Grabiec
0b516da95b Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.

Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single  `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`.

Backport: no, it's a new feature

Fixes: https://github.com/scylladb/scylladb/issues/19649

Closes scylladb/scylladb#20853

* github.com:scylladb/scylladb:
  storage_service: always wake up load balancer on update tablet metadata
  db: schema_applier: call destroy also when exception occurs
  db: replica: simplify seeding ERM during shema change
  db: remove cleanup from add_column_family
  db: abort on exception during schema commit phase
  db: make user defined types changes atomic
  replica: db: make keyspace schema changes atomic
  db: atomically apply changes to tables and views
  replica: make truncate_table_on_all_shards get whole schema from table_shards
  service: split update_tablet_metadata into two phases
  service: pull out update_tablet_metadata from migration_listener
  db: service: add store_service dependency to schema_applier
  service: simplify load_tablet_metadata and update_tablet_metadata
  db: don't perform move on tablet_hint reference
  replica: split add_column_family_and_make_directory into steps
  replica: db: split drop_table into steps
  db: don't move map references in merge_tables_and_views()
  db: introduce commit_on_shard function
  db: access types during schema merge via special storage
  replica: make non-preemptive keyspace create/update/delete functions public
  replica: split update keyspace into two phases
  replica: split creating keyspace into two functions
  db: rename create_keyspace_from_schema_partition
  db: decouple functions and aggregates schema change notification from merging code
  db: store functions and aggregates change batch in schema_applier
  db: decouple tables and views schema change notifications from merging code
  db: store tables and views schema diff in schema_applier
  db: decouple user type schema change notifications from types merging code
  service: unify keyspace notification functions arguments
  db: replica: decouple keyspace schema change notifications to a separate function
  db: add class encapsulating schema merging
2025-06-10 13:45:32 +02:00
Raphael S. Carvalho
2d716f3ffe replica: Fix truncate assert failure
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.

1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.

So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.

Fixes #23771.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24426
2025-06-08 15:59:15 +03:00
Marcin Maliszkiewicz
a27776b4ff replica: make truncate_table_on_all_shards get whole schema from table_shards
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.

Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
92e3d69f79 db: service: add store_service dependency to schema_applier
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
2025-06-06 08:50:33 +02:00
Dawid Mędrek
c60035cbf6 test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
We've adjusted all of the Boost tests so they respect the invariant
enforced by the `rf_rack_valid_keyspaces` configuration option, or
explicitly disabled the option in those that turned out to be more
problematic and will require more attention. Thanks to that, we can
now enable it by default in the test suite.
2025-05-27 18:53:39 +02:00
Aleksandra Martyniuk
9c03255fd2 cql_test_env: main: move stream_manager initialization
Currently, stream_manager is initialized after storage_service and
so it is stopped before the storage_service is. In its stop method
storage_service accesses stream_manager which is uninitialized
at a time.

Move stream_manager initialization over the storage_service initialization.

Fixes: #23207.

Closes scylladb/scylladb#24008
2025-05-15 17:17:35 +03:00
Avi Kivity
5e764d1de2 Merge 'Drop v2 and flat from reader and related names' from Botond Dénes
Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names.
Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant).
The changes in this PR are entirely mechanical, mostly just search-and-replace.

Code cleanup, no backport required.

Closes scylladb/scylladb#24087

* github.com:scylladb/scylladb:
  test/boost/mutation_reader_another_test: drop v2 from reader and related names
  test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/
  test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/
  test/boost/mutation_test: s/consumer_v2/consumer/
  test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/
  readers/mutation_readers: s/generating_reader_v2/generating_reader/
  readers/mutation_readers: s/delegating_reader_v2/delegating_reader/
  readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/
  readers/mutation_source: s/make_reader_v2/make_mutation_reader/
  readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/
  readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/
  mutation/mutation_compactor: drop v2 from compactor and related names
  replica/table: s/make_reader_v2/make_mutation_reader/
  mutation_writer: s/bucket_writer_v2/bucket_writer/
  readers/queue: drop v2 from reader and related names
  readers/multishard: drop v2 from reader and related names
  readers/evictable: drop v2 from reader and related names
  readers/multi_range: remove flat from name
2025-05-11 22:22:35 +03:00
Botond Dénes
17b667b116 test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/ 2025-05-09 07:53:30 -04:00
Botond Dénes
674d41e3e6 readers/mutation_source: s/make_reader_v2/make_mutation_reader/ 2025-05-09 07:53:29 -04:00
Botond Dénes
ca7f557e86 readers/multishard: drop v2 from reader and related names 2025-05-09 07:53:29 -04:00
Michał Chojnowski
1bcf77951c compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.

Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.

There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.

Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.

There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).

It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.

If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.

So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.
2025-05-07 14:43:18 +02:00
Michał Chojnowski
8649adafa8 test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
In next patches, make_sstable_compressor_factory() will have to
disappear.
In preparation for that, we switch to a seastar::thread-dependent
replacement.
2025-05-07 14:43:04 +02:00
Michał Chojnowski
0e4d0ded8d test: remove sstables::test_env::do_with()
`sstable_manager` depends on `sstable_compressor_factory&`.
Currently, `test_env` obtains an implementation of this
interface with the synchronous `make_sstable_compressor_factory()`.

But after this patch, the only implementation of that interface
`sstable_compressor_factory&` will use `sharded<...>`,
so its construction will become asynchronous,
and the synchronous `make_sstable_compressor_factory()` must disappear.

There are several possible ways to deal with this, but I think the
easiest one is to write an asynchronous replacement for
`make_sstable_compressor_factory()`
that will keep the same signature but will be only usable
in a `seastar::thread`.

All other uses of `make_sstable_compressor_factory()` outside of
`test_env::do_with()` already are in seastar threads,
so if we just get rid of `test_env::do_with()`, then we will
be able to use that thread-dependent replacement. This is the
purpose of this commit.

We shouldn't be losing much.
2025-05-07 13:19:21 +02:00
Raphael S. Carvalho
21d1e78457 compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
59dad2121f compaction: Introduce token_range() to table_state
This provides a way for compaction layer to know compaction group's
token range. It will be important for sstable set impl to know
the token range of underlying group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Avi Kivity
2dcd2b21ae Merge 'tablets: Equalize per-table balance when allocating tablets for a new table' from Tomasz Grabiec
Fixes the following scenario:

1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed

We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:

1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation as in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.

Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.

Fixes #23631

Backport to 2025.1 because load imbalance is a serious problem in production.

Closes scylladb/scylladb#23708

* github.com:scylladb/scylladb:
  tablets: Equalize per-table balance when allocating tablets for a new table
  load_sketch: Tolerate missing tablet_map when selecting for a given table
  tests: tablets: Simplify tests by moving common code to topology_builder
2025-04-21 17:06:30 +03:00