Commit Graph

1478 Commits

Author SHA1 Message Date
Avi Kivity
5e764d1de2 Merge 'Drop v2 and flat from reader and related names' from Botond Dénes
Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names.
Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant).
The changes in this PR are entirely mechanical, mostly just search-and-replace.

Code cleanup, no backport required.

Closes scylladb/scylladb#24087

* github.com:scylladb/scylladb:
  test/boost/mutation_reader_another_test: drop v2 from reader and related names
  test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/
  test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/
  test/boost/mutation_test: s/consumer_v2/consumer/
  test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/
  readers/mutation_readers: s/generating_reader_v2/generating_reader/
  readers/mutation_readers: s/delegating_reader_v2/delegating_reader/
  readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/
  readers/mutation_source: s/make_reader_v2/make_mutation_reader/
  readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/
  readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/
  mutation/mutation_compactor: drop v2 from compactor and related names
  replica/table: s/make_reader_v2/make_mutation_reader/
  mutation_writer: s/bucket_writer_v2/bucket_writer/
  readers/queue: drop v2 from reader and related names
  readers/multishard: drop v2 from reader and related names
  readers/evictable: drop v2 from reader and related names
  readers/multi_range: remove flat from name
2025-05-11 22:22:35 +03:00
Botond Dénes
17b667b116 test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/ 2025-05-09 07:53:30 -04:00
Botond Dénes
674d41e3e6 readers/mutation_source: s/make_reader_v2/make_mutation_reader/ 2025-05-09 07:53:29 -04:00
Botond Dénes
ca7f557e86 readers/multishard: drop v2 from reader and related names 2025-05-09 07:53:29 -04:00
Michał Chojnowski
1bcf77951c compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.

Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.

There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.

Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.

There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).

It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.

If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.

So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.
2025-05-07 14:43:18 +02:00
Michał Chojnowski
8649adafa8 test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
In next patches, make_sstable_compressor_factory() will have to
disappear.
In preparation for that, we switch to a seastar::thread-dependent
replacement.
2025-05-07 14:43:04 +02:00
Michał Chojnowski
0e4d0ded8d test: remove sstables::test_env::do_with()
`sstable_manager` depends on `sstable_compressor_factory&`.
Currently, `test_env` obtains an implementation of this
interface with the synchronous `make_sstable_compressor_factory()`.

But after this patch, the only implementation of that interface
`sstable_compressor_factory&` will use `sharded<...>`,
so its construction will become asynchronous,
and the synchronous `make_sstable_compressor_factory()` must disappear.

There are several possible ways to deal with this, but I think the
easiest one is to write an asynchronous replacement for
`make_sstable_compressor_factory()`
that will keep the same signature but will be only usable
in a `seastar::thread`.

All other uses of `make_sstable_compressor_factory()` outside of
`test_env::do_with()` already are in seastar threads,
so if we just get rid of `test_env::do_with()`, then we will
be able to use that thread-dependent replacement. This is the
purpose of this commit.

We shouldn't be losing much.
2025-05-07 13:19:21 +02:00
Raphael S. Carvalho
21d1e78457 compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
59dad2121f compaction: Introduce token_range() to table_state
This provides a way for compaction layer to know compaction group's
token range. It will be important for sstable set impl to know
the token range of underlying group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Avi Kivity
2dcd2b21ae Merge 'tablets: Equalize per-table balance when allocating tablets for a new table' from Tomasz Grabiec
Fixes the following scenario:

1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed

We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:

1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation as in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.

Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.

Fixes #23631

Backport to 2025.1 because load imbalance is a serious problem in production.

Closes scylladb/scylladb#23708

* github.com:scylladb/scylladb:
  tablets: Equalize per-table balance when allocating tablets for a new table
  load_sketch: Tolerate missing tablet_map when selecting for a given table
  tests: tablets: Simplify tests by moving common code to topology_builder
2025-04-21 17:06:30 +03:00
Pavel Emelyanov
eb5b52f598 Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.

Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.

Fixes: scylladb/scylladb#23278
Fixes: scylladb/scylladb#22869

Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga

Closes scylladb/scylladb#23800

* github.com:scylladb/scylladb:
  doc: changing topology when changing snitches is no longer supported
  test: cluster: introduce test_no_dc_rack_change
  storage_service: don't update DC/rack in update_topology_with_local_metadata
  main: make dc and rack immutable after bootstrap
  test: cluster: remove test_snitch_change
2025-04-21 15:52:55 +03:00
Piotr Dulikowski
ce2fab7cce main: make dc and rack immutable after bootstrap
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.

Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.

Fixes: scylladb/scylladb#23278
2025-04-17 16:22:26 +02:00
Tomasz Grabiec
d493a8d736 tests: tablets: Simplify tests by moving common code to topology_builder
Reduces code duplication.
2025-04-15 16:05:41 +02:00
Pavel Emelyanov
b25cb5af0c Merge 'Use named gates' from Benny Halevy
Name the gates and phased barriers we use
to make it easy to debug gate_closed_exception

Refs https://github.com/scylladb/seastar/pull/2688

* Enhancement only, no backport needed

Closes scylladb/scylladb#23329

* github.com:scylladb/scylladb:
  utils: loading_cache: use named_gate
  utils: flush_queue: use named_gate
  sstables_manager: use named gate
  sstables_loader: use named gate
  utils: phased_barrier, pluggable: use named gate
  utils: s3::client::multipart_upload: use named gate
  utils: s3::client: use named_gate
  transport: controller: use named gate
  tracing: trace_keyspace_helper: use named gate
  task_manager: module: use named gate
  topology_coordinator: use named gate
  storage_service: use named gate
  storage_proxy: wait_for_hint_sync_point: use named gate
  storage_proxy: remote: use named gate
  service: session: use named gate
  service: raft: raft_rpc: use named gate
  service: raft: raft_group0: use named gate
  service: raft: persistent_discovery: use named gate
  service: raft: group0_state_machine: use named gate
  service: migration_manager: use named gate
  replica: table: use named gate
  replica: compaction_group, storage_group: use named gate
  redis: query_processor: use named gate
  repair: repair_meta: use named gate
  reader_concurrency_semaphore: use named gate
  raft: server_impl: use named gate
  querier_cache: use named gate
  gms: gossiper: use named gate
  generic_server: use named gate
  db: sstables_format_listener: use named gate
  db: snapshot: backup_task: use named gate
  db: snapshot_ctl: use named gate
  hints: hints_sender: use named gate
  hints: manager: use named gate
  hints: hint_endpoint_manager: use named gate
  commitlog: segment_manager: use named gate
  db: batchlog_manager: use named gate
  query_processor: remote: use named gate
  compaction: compaction_state: use named gate
  alternator/server: use named_gate
2025-04-14 20:56:32 +03:00
Pavel Emelyanov
1bd991a111 test: Inherit sstable_assertions from sstables::test
The latter class is invented to let tests access private fields of an
sstable (mostly methods). The former is in fact an extended version of
that also does some checks. Howerver, they don't inherit from each
other, and the sstable_assertions partially duplicates some funtionality
of the test one.

Add the inheritance, remove the duplicated methods from the child class,
update the callers (the test class returns future<>s, the assertions one
"knows" it runs in seastar thread) and marm sstable::read_toc() private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23697
2025-04-14 13:45:14 +03:00
Benny Halevy
e1fe82ed33 utils: phased_barrier, pluggable: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Tomasz Grabiec
0b9a75d7b6 virtual-tables: Introduce system.load_per_node
Can be used to query per-node stats about load as seen by the load
balancer.

In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
2025-04-09 20:21:51 +02:00
Marcin Maliszkiewicz
b94acfb37b test: remove alternator code from perf-simple-query
This kind of benchmark was superseded by perf-alternator
which has more options, workflows and most importantly
measures overhead of http server layer (including json parsing).

There is no need to maintain additional code in perf-simple-query.

Closes scylladb/scylladb#23474
2025-04-06 18:15:16 +03:00
Botond Dénes
a0d8102a1f replica/memtable: s/make_flat_reader/make_mutation_reader/
Following the recent refactoring of removing "flat" and "v2" from reader
names, replacing all the fully qualified names with simply "mutation_reader".

Closes scylladb/scylladb#23346
2025-04-01 17:58:13 +03:00
Pavel Emelyanov
2ee9cec1d3 Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar
Move `object_storage.yaml` endpoints to `scylla.yaml`

This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.

This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f

Refs https://github.com/scylladb/scylladb/issues/22428

Closes scylladb/scylladb#22952

* github.com:scylladb/scylladb:
  Remove db::config::object_storage_config
  Move `object_storage.yaml` endpoints to `scylla.yaml`
2025-04-01 16:01:44 +03:00
Michał Chojnowski
b77c611c00 raft/group0_state_machine: on system.dicts mutations, pass the affected partitition keys to the callback
Before this patch, `system.dicts` contains only one dictionary, for RPC
compression, with the fixed name "general".

In later parts of this series, we will add more dictionaries to
system.dicts, one per table, for SSTable compression.

To enable that, this patch adjusts the callback mechanism for group0's `write_mutations`
command, so that the mutation callbacks for group0-managed tables can see which
partition keys were affected. This way, the callbacks can query only the
modified partitions instead of doing a full scan. (This is necessary to
prevent quadratic behaviours.)

For now, only the `system.dicts` callback uses the partition keys.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
30a9d471fa sstables: plug an sstable_compressor_factory into sstables_manager
Create a `sstable_compressor_factory_impl` in `scylla_main`,
and pipe it through constructors into `sstables_manager`.

In next commits, the factory available through the `sstables_manager`
will be used to create compressors for SSTable readers and writers.
2025-04-01 00:07:28 +02:00
Robert Bindar
b647196121 Remove db::config::object_storage_config
That map became redundant once we added
object_storage_endpoints in the config, this patch removes
it and switches all the user code to use the new option.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 17:15:12 +03:00
Avi Kivity
7646e1448a Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.

The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
  keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
  so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
  keyspaces are RF-rack-valid. If not, the initialization fails.

We provide tests verifying that the changes behave as intended.

---

Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.

---

Implementation strategy (going beyond the scope of this PR):

1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.

---

Fixes scylladb/scylladb#20356
Fixes scylladb/scylladb#23276
Fixes scylladb/scylladb#23300

---

Backport: this is part of the requirements for releasing 2025.1.

Closes scylladb/scylladb#23138

* github.com:scylladb/scylladb:
  main: Refuse to start node when RF-rack-invalid keyspace exists
  cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
  db/config: Introduce RF-rack-valid keyspaces
2025-03-20 19:10:36 +02:00
Botond Dénes
d06bc27979 Merge 'Don't export string filenames from sstable' from Pavel Emelyanov
There are several sstring-returning methods on class sstable that return paths to files. Mostly these are used to print them into logs, sometimes are used to be put into exception messages. And there are places that use these strings as file names. Since now sstables can also be stored on S3, generic code shouldn't consider those strings as on disk file names.

Other than that, even when the methods are used to put component names into logs, in many cases these log messages come with debug or trace level, so generated strings are immediately dropped on the floor, but generating it is not extremely cheap. Code would benefit from using lazily-printed names.

This change introduces the component_name struct that wraps sstable reference and component ID (which is a numerical enum of several items). When printed, the component_name formatter calls the aforementioned filename generation, thus implementing lazy printing. And since there's no automatic conversion of component_name-s into strings, all the code that treats them as file paths, becomes explicit.

refs: #14122 (previous ugly attempt to achieve the same goal)

Closes scylladb/scylladb#23194

* github.com:scylladb/scylladb:
  sstable: Remove unused malformed_sstable_exctpion(string filename)
  sstables: Make filename() return component_name
  sstables: Make file_writer keep component_name on board
  sstables: Make get_filename() return component_name
  sstables: Make toc_filename() return component_name
  sstables: Make sstable::index_filename() return component_name
  sstables: Introduce struct component_name
  sstables: Remove unused sstable::component_filenames() method
  sstables: Do not print component filenames on load-and-stream wrap-up
  sstables: Explicitly format prefix in S3 object name making
  sstables: Don't include directory name in exception
  sstables: Use fmt::format instead of string concatenation
  sstables: Rename filename($component) calls to ${component}_filename()
  sstables: Rename local filename variable to component_name
2025-03-20 09:51:03 +02:00
Dawid Mędrek
0e04a6f3eb main: Refuse to start node when RF-rack-invalid keyspace exists
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.

Fixes scylladb/scylladb#23300
2025-03-19 15:13:44 +01:00
Pavel Emelyanov
f06cc32812 sstables: Make filename() return component_name
Similarly to toc_, index_ and data filenames, make the generic component
name getter return back not string, but a wrapper object. Most of
callers are log messages and exception generations. Other than that
there are tests, filesystem storage driver and few more places in
generic code who "know" that they work with real files, so make them use
explicit fmt::to_string().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
0cdeed858c sstables: Make toc_filename() return component_name
Most of the callers use the returned value as log message parameter,
some construct malformed_sstable_exception that was prepared by previous
patch.

The remaining callers explicitly use fmt::to_string(), these are

- pending deletion log creation
- filesystem storage code
- tests
- stream-blob code that re-loads sstable

All but the last one are OK to use string toc name, the last one is not
very correct in its usage of toc_filename string, but it needs more care
to be fixed properly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
dcc9167734 sstables: Rename filename($component) calls to ${component}_filename()
There's a generic sstable::filename(component_type) method that returns
a file name for the given component. For "popular" components, namely
TOC, Data and Index there are dedicated sstable methods to get their
names. Fix existing callers of the generic method to use the former.
It's shorter, nicer and makes further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Botond Dénes
969b07fdfd test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/
The class actually implements the FlattenedConsumer, so fix the comment.
This eliminates the only reference to the StreamedMutationConsumer
concept.
2025-03-18 07:57:04 -04:00
Pavel Emelyanov
2bb455ec75 Merge 'Main: stop system_keyspace' from Benny Halevy
This series adds an async guard to system_keyspace operations
and adds a deferred action to stop the system_keyspace in main() before destroying the service.

This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped.

* Enhancement, no backport needed

Closes scylladb/scylladb#23113

* github.com:scylladb/scylladb:
  main: stop system keyspace
  system_keyspace: call shutdown from stop
  system_keyspace: shutdown: allow calling more than once
  database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
  utils: add class pluggable
2025-03-14 13:23:28 +03:00
Avi Kivity
696ce4c982 Merge "convert some parts of the gossiper to host ids" from Gleb
"
This is series starts conversion of the gossiper to use host ids to
index nodes. It does not touch the main map yet, but converts a lot of
internal code to host id. There are also some unrelated cleanups that
were done while working on the series. On of which is dropping code
related to old shadow round. We replaced shadow round with explicit
GOSSIP_GET_ENDPOINT_STATES verb in cd7d64f588
which is in scylla-4.3.0, so there should be no compatibility problem.
We already dropped a lot of old shadow round code in previous patches
anyway.

I tested manually that old and new node can co-exist in the same
cluster,
"

* 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits)
  gossiper: drop unneeded code
  gossiper: move _expire_time_endpoint_map to host_id
  gossiper: move _just_removed_endpoints to host id
  gossiper: drop unused get_msg_addr function
  messaging_service: change connection dropping notification to pass host id only
  messaging_service: pass host id to remove_rpc_client in down notification
  treewide: pass host id to endpoint_lifecycle_subscriber
  treewide: drop endpoint life cycle subscribers that do nothing
  load_meter: move to host id
  treewide: use host id directly in endpoint state change subscribers
  treewide: pass host id to endpoint state change subscribers
  gossiper: drop deprecated unsafe_assassinate_endpoint operation
  storage_service: drop unused code in handle_state_removed
  treewide: drop endpoint state change subscribers that do nothing
  gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory
  gossiper: start using host ids to send messages earlier
  messaging_service: add temporary address map entry on incoming connection
  topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
  treewide: move everyone to use host id based gossiper::is_alive and drop ip based one
  storage_proxy: drop unused template
  ...
2025-03-13 13:36:31 +02:00
Avi Kivity
b1d9f80d85 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
2025-03-11 14:34:27 +02:00
Gleb Natapov
f0af3f261e messaging_service: add temporary address map entry on incoming connection
We want to move to use host ids as soon as possible. Currently it is
possible only after the full gossiper exchange (because only at this
point gossiper state is added and with it address map entry). To make it
possible to move to host ids earlier this patch adds address map entries
on incoming communication during CLIENT_ID verb processing. The patch
also adds generation to CLIENT_ID to use it when address map is updated.
It is done so that older gossiper entries can be overwritten with newer
mapping in case of IP change.
2025-03-11 12:09:21 +02:00
Tomasz Grabiec
69c49fb1a7 test: boost: tablets_test: Always provide capacity in load_stats
Move shared_load_stats to topology_builder.hh so that topology_builder
can maintain it. It will set capacity for all created nodes. Needed
after load balancer requires capacity to make decisions.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
d01cc16d1e config, disk_space_monitor: Allow overriding capacity via config
Intended for testing, or hot-fixing out-of-space issues in production.

Tablet load balancer uses this information for determining per-shard load
so reducing capacity will cause tablets to be migrated away from the node.
2025-03-06 13:35:37 +01:00
Avi Kivity
28906c9261 Merge 'scylla-sstable: introduce the query command' from Botond Dénes
The scylla-sstable dump-* command suite has proven invaluable  in many investigations. In certain cases however, I found that `dump-data` is quite cumbersome. An example would be trying to find certain values in an sstable, or trying to read the content of system tables when a node is down. For these cases, `dump-data`  is very cumbersome: one has to trudge through tons of uninteresting metadata and do compaction in their heads. This PR introduces the new scylla-sstable query command, specifically targeted at situations like this: it allows executing queries on sstables, exposing to the user all the power of CQL, to tailor the output as they see fit.

Select everything from a table:

    $ scylla sstable query --system-schema /path/to/data/system_schema/keyspaces-*/*-big-Data.db
     keyspace_name                 | durable_writes | replication
    -------------------------------+----------------+-------------------------------------------------------------------------------------
            system_replicated_keys |           true |                         ({class : org.apache.cassandra.locator.EverywhereStrategy})
                       system_auth |           true |   ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 1})
                     system_schema |           true |                              ({class : org.apache.cassandra.locator.LocalStrategy})
                system_distributed |           true |   ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 3})
                            system |           true |                              ({class : org.apache.cassandra.locator.LocalStrategy})
                                ks |           true | ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})
                     system_traces |           true |   ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 2})
     system_distributed_everywhere |           true |                         ({class : org.apache.cassandra.locator.EverywhereStrategy})

Select everything from a single SSTable, use the JSON output (filtered through [jq](https://jqlang.github.io/jq/) for better readability):

    $ scylla sstable query --system-schema --output-format=json /path/to/data/system_schema/keyspaces-*/me-3gm7_127s_3ndxs28xt4llzxwqz6-big-Data.db | jq
    [
      {
        "keyspace_name": "system_schema",
        "durable_writes": true,
        "replication": {
          "class": "org.apache.cassandra.locator.LocalStrategy"
        }
      },
      {
        "keyspace_name": "system",
        "durable_writes": true,
        "replication": {
          "class": "org.apache.cassandra.locator.LocalStrategy"
        }
      }
    ]

Select a specific field in a specific partition using the command-line:

    $ scylla sstable query --system-schema --query "select replication from scylla_sstable.keyspaces where keyspace_name='ks'" ./scylla-workdir/data/system_schema/keyspaces-*/*-Data.db
     replication
    -------------------------------------------------------------------------------------
     ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})

Select a specific field in a specific partition using ``--query-file``:

    $ echo "SELECT replication FROM scylla_sstable.keyspaces WHERE keyspace_name='ks';" > query.cql
    $ scylla sstable query --system-schema --query-file=./query.cql ./scylla-workdir/data/system_schema/keyspaces-*/*-Data.db
     replication
    -------------------------------------------------------------------------------------
     ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})

New functionality: no backport needed.

Closes scylladb/scylladb#22007

* github.com:scylladb/scylladb:
  docs/operating-scylla: document scylla-sstable query
  test/cqlpy/test_tools.py: add tests for scylla-sstable query
  test/cqlpy/test_tools.py: make scylla_sstable() return table name also
  scylla-sstable: introduce the query command
  tools/utils: get_selected_operation(): use std::string for operation_options
  utils/rjson: streaming_writer: add RawValue()
  cql3/type_json: add to_json_type()
  test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread()
2025-03-06 13:42:45 +02:00
Tomasz Grabiec
7e7f1e6f91 storage_service, tablets: Collect per-node capacity in load_stats
New RPC is introduced becuase load_stats was marked "final" in the IDL.

Will be needed by capacity-aware load balancing.
2025-03-06 12:17:32 +01:00
Benny Halevy
7a624e3df8 system_keyspace: call shutdown from stop
and use that to replace the explicit shutdown when stopped
in cql_test_env.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:23 +02:00
Kefu Chai
5571b537b5 tree: Make values mutable to enable move semantics
Previously, variables were marked as const, causing std::move() calls to
be redundant as reported by GCC warnings. This change either removes
const qualifiers or marks related lambdas as mutable, allowing the
compiler to properly utilize move constructors for better performance.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23066
2025-03-03 13:53:02 +03:00
Avi Kivity
3f05fa3a9b test: lib: replace boost::generate with std equivalent
Reduces dependencies on boost/range.

Closes scylladb/scylladb#23034
2025-02-27 01:05:46 +01:00
Kefu Chai
6e4cb20a69 tree: implement boost::accumulate with std::ranges library
Replace boost::accumulate() calls with std::ranges facilities. This
change reduces external dependencies and modernizes the codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23062
2025-02-26 23:22:02 +02:00
Botond Dénes
5d63ef4d15 Merge 'scylla sstable: Add standard extensions and propagate to schema load ' from Calle Wilund
Fixes #22314

Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.

Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points.
Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line.

Closes scylladb/scylladb#22327

* github.com:scylladb/scylladb:
  tools: Add standard extensions and propagate to schema load
  cql_test_env: Use add all extensions instead of inidividually
  main: Move extensions adding to function
  tomstone_gc: Make validate work for tools
2025-02-26 13:52:47 +02:00
Avi Kivity
6e70e69246 test/lib: mutation_assertions: deinline
While generally better to reduce inline code, here we get
rid of the clustering_interval_set.hh dependency, which in turns
depends on boost interval_set, a large dependency.

incremental_compaction_test.cc is adjusted for a missing header.

Closes scylladb/scylladb#22957
2025-02-25 11:40:54 +01:00
Avi Kivity
d99df7af6c Merge 'Respect per-shard tablet goal and 10x default per-shard tablet count' from Tomasz Grabiec
This series achieves two things:

1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards

    This will result in new tables having at least 10 tablet replicas per
    shard by default.

    We want this to reduce tablet load imbalance due to differences in
    tablet count per shard, where some shards have 1 tablet and some
    shards have 2 tablets. With higher tablet count per shard, this
    difference-by-one is less relevant.

    Fixes https://github.com/scylladb/scylladb/issues/21967

2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count

    The per-shard goal is enforced by controlling average per-shard tablet replica
    count in a given DC, which is controlled by per-table tablet
    count. This is effective in respecting the limit on individual shards
    as long as tablet replicas are distributed evenly between shards.
    There is no attempt to move tablets around in order to enforce limits
    on individual shards in case of imbalance between shards.

    If the average per-shard tablet count exceeds the limit, all tables
    which contribute to it (have replicas in the DC) are scaled down
    by the same factor. Due to rounding up to the nearest power of 2,
    we may overshoot the per-shard goal by at most a factor of 2.

    The scaling is applied after computing desired tablet count due to
    all other factors: per-table tablet count hints, defaults, average tablet size.

    If different DCs want different scale factors of a given table, the
    lowest scale factor is chosen for a given table.

    When creating a new table, its tablet count is determined by tablet
    scheduler using the scheduler logic, as if the table was already created.
    So any scaling due to per-shard tablet count goal is reflected immediately
    when creating a table. It may however still take some time for the system
    to shrink existing tables. We don't reject requests to create new tables.

    Fixes #21458

Closes scylladb/scylladb#22522

* github.com:scylladb/scylladb:
  config, tablets: Allow tablets_initial_scale_factor to be a fraction
  test: tablets_test: Test scaling when creating lots of tables
  test: tablets_test: Test tablet count changes on per-table option and config changes
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config
  config: Make tablets_initial_scale_factor live-updateable
  tablets: load_balancer: Pick initial_scale_factor from config
  tablets, load_balancer: Fix and improve logging of resize decisions
  tablets, load_balancer: Log reason for target tablet count
  tablets: load_balancer: Move hints processing to tablet scheduler
  tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
  tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
  tablets: load_balancer: Determine desired count from size separately from count from options
  tablets: load_balancer: Determine resize decision from target tablet count
  tablets: load_balancer: Allow splits even if table stats not available
  tablets: load_balancer: Extract make_sizing_plan()
  tablets: Add formatter for resize_decision::way_type
  tablets: load_balancer: Simplify resize_urgency_cmp()
  tablets: load_balancer: Keep config items as instance members
  locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology()
  tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
  tablets: Set default initial tablet count scale to 10
  tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology()
  tablets: load_balancer: Extract get_schema_and_rs()
  tablets: load_balancer: Drop test_mode
2025-02-24 17:59:26 +02:00
Kefu Chai
5be39740a8 tree: migrate from boost::find to std::ranges algorithms
Replace boost::find() calls with std::ranges::find() and std::ranges::contains()
to leverage modern C++ standard library features. This change reduces external
dependencies and modernizes the codebase.

The following changes were made:
- Replaced boost::find() with std::ranges::find() where index/iterator is needed
- Used std::ranges::contains() for simple element presence checks

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22920
2025-02-20 09:28:57 +03:00
Tomasz Grabiec
f3b63bfeff test: cql_test_env: Expose db config 2025-02-19 16:29:08 +01:00
Avi Kivity
30a38e61d4 Merge 'sstables_manager: trigger reclaim/reload on components_memory_reclaim_threshold update' from Lakshmi Narayanan Sreethar
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.

This PR first refactors the reclaim and reload logic into a single
background fiber. It then updates the sstable manager to subscribe to
changes in the `components_memory_reclaim_threshold` configuration value
and immediately triggers the reclaim/reload fiber when a change is
detected.

Fixes #21947

This is an improvement and does not need to be backported.

Closes scylladb/scylladb#22725

* github.com:scylladb/scylladb:
  sstables_manager: trigger reclaim/reload on `components_memory_reclaim_threshold` update
  sstables_manager: maybe_reclaim_components: yield between iterations
  sstables_manager: rename `increment_total_reclaimable_memory_and_maybe_reclaim()`
  sstables_manager: move reclaim logic into `components_reclaim_reload_fiber()`
  sstables_manager: rename `_sstable_deleted_event` condition variable
  sstables_manager: rename `components_reloader_fiber()`
  sstables_manager: fix `maybe_reclaim_components()` indentation
  sstables_manager: reclaim components memory until usage falls below threshold
  sstables_manager: introduce `get_components_memory_reclaim_threshold()`
  sstables_manager: extract `maybe_reclaim_components()`
  sstables_manager: fix `maybe_reload_components()` indentation
  sstables_manager: extract out `maybe_reload_components()`
2025-02-17 22:33:33 +02:00
Botond Dénes
01a4d30d88 test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread()
This variant of do_with_cql_env(), forgoes the reentrancy support in the
regular do_with_cql_env() variants, and re-uses the caller's exsting
seastar thread. This is an optimized version for callers which don't
need reentrancy and already have a thread.
2025-02-17 08:01:38 -05:00
Piotr Dulikowski
e4d574fdbb Merge 'Fix view-builder vs (repair and streaming) initialization order' from Pavel Emelyanov
Both, repair and streaming depend on view builder, but since the builder is started too late, both keep sharded<> reference on it and apply `if (view_builder.local_is_initialized())` safety checks.

However, view builder can do its sharded start much earlier, there's currently nothing that prevents it from doing so. This PR moves view builder start up together with some other of its dependencies, and relaxes the way repair and streaming use their view-builder references, in particular -- removes those ugly initialization checks.

refs: scylladb/scylladb#2737

Closes scylladb/scylladb#22676

* github.com:scylladb/scylladb:
  streaming: Relax streaming::make_streamig_consumer() view builder arg
  streaming: Keep non-sharded view_builder dependency reference
  streaming: Remove view_builder.local_is_initialized() checks
  repair: Keep non-sharded view_builder dependency reference
  repair: Remove view_builder.local_is_initialized() checks
  main: Start sharded<view_builder> earlier
  test/cql_env: Move stream manager start lower
2025-02-17 10:03:28 +01:00