Commit Graph

4972 Commits

Author SHA1 Message Date
Radosław Cybulski
c36614e16d alternator: add size check to BatchItemWrite
Add a size check for BatchItemWrite command - if the item count is
bigger than configuration value `alternator_maximum_batch_write_size`,
an error will be raised and no modification will happen.

This is done to synchronize with DynamoDB, where maximum size of
BatchItemWrite is 25. To avoid complaints from clients, who use
our feature of BatchWriteItem being limitless we set default value
to 100.

Fixes #5057

Closes scylladb/scylladb#23232
2025-04-02 14:48:00 +03:00
Avi Kivity
882f405eed Merge "Convert gossiper's endpoint state map to be host id based" from Gleb
"
The series makes endpoint state map in the gossiper addressable by host
id instead of ips. The transition has implication outside of the
gossiper as well. Gossiper based topology operations are affected by
this change since they assume that the mapping is ip based.

On wire protocol is not affected by the change as maps that are sent by
the gossiper protocol remain ip based. If old node sends two different
entries for the same host id the one with newer generation is applied.
If new node has two ids that are mapped to the same ip the newer one is
added to the outgoing map.

Interoperability was verified manually by running mixed cluster.

The series concludes the conversion of the system to be host id based.
"

* 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev:
  gossiper: make examine_gossiper private
  gossiper: rename get_nodes_with_host_id to get_node_ip
  treewide: drop id parameter from gossiper::for_each_endpoint_state
  treewide: move gossiper to index nodes by host id
  gossiper: drop ip from replicate function parameters
  gossiper: drop ip from apply_new_states parameters
  gossiper: drop address from handle_major_state_change parameter list
  gossiper: pass rpc::client_info to gossiper_shutdown verb handler
  gossiper: add try_get_host_id function
  gossiper: add ip to endpoint_state
  serialization: fix std::map de-serializer to not invoke value's default constructor
  gossiper: drop template from  wait_alive_helper function
  gossiper: move get_supported_features and its users to host id
  storage_service: make candidates_for_removal host id based
  gossiper: use peers table to detect address change
  storage_service: use std::views::keys instead of std::views::transform that returns a key
  gossiper: move _pending_mark_alive_endpoints to host id
  gossiper: do not allow to assassinate endpoint in raft topology mode
  gossiper: fix indentation after previous patch
  gossiper: do not allow to assassinate non existing endpoint
2025-04-02 12:30:00 +03:00
Kefu Chai
6da758d74c config: mark uuid_sstable_identifiers_enabled unused
the option of `uuid_sstable_identifier_enabled` was introduced in
f014ccf3 . the first version which has this change was 5.4, and
6.1 has been branched. during the discussion of backup and restore,
we realized that we've been taking efforts to address problems which
could have been addressed with the sstable with UUID-based identifier.
see also #10459 which is the issue which proposed to implement UUID-v1
based sstable identifier.

now that two major releases passed, we should have the luxury to mark
this option "unused". this option which was previously introduced to
keep the backward compatibility, and to allow user to opt-out of the
feature for some reasons.

so in this change,  mark the option unused, so that if any user still
sets this option with command line, they will get a clear error. but
we still parse and handle this setting in `scylla.yaml`, so that this
option is still respected for existing settings, and for existing tests,
which are not yet prepared for the uuid-based sstable identifiers.

Refs #10459
Fixes #20337

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20341
2025-04-01 20:21:47 +03:00
Pavel Emelyanov
2ee9cec1d3 Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar
Move `object_storage.yaml` endpoints to `scylla.yaml`

This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.

This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f

Refs https://github.com/scylladb/scylladb/issues/22428

Closes scylladb/scylladb#22952

* github.com:scylladb/scylladb:
  Remove db::config::object_storage_config
  Move `object_storage.yaml` endpoints to `scylla.yaml`
2025-04-01 16:01:44 +03:00
Michał Chojnowski
4f0d453acf dict_autotrainer: introduce sstable_dict_autotrainer
Add a fiber responsible for periodic re-training of compression dictionaries
(for tables which opted into dict-aware compression).

As of this patch, it works like this:
every `$tick_period` (15 minutes), if we are the current Raft leader,
we check for dict-aware tables which have no dict, or a dict older
than `$retrain_period`.

For those tables, if they have enough data (>1GiB) for a training,
we train a new dict and check if it's significantly better
than the current one (provides ratio smaller than 95% of current ratio),
and if so, we update the dict.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
9d02e2c005 db/system_keyspace: add query_dict_timestamp
Adds a helper method which queries the creation timestamp
of a given dict in `system.dicts`.

We will later use the age of the current SSTable compression dict
to decide if another training should be done already.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
bea866a46f main: clean up sstable compression dicts after table drops
When a table is dropped, its corresponding dictionary in `system.dicts`
-- if any -- should be deleted, otherwise it will remain forever as
garbage.

This commit implements such cleanup.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
4856f4acca db/system_keyspace: let system.dicts helpers be used for dicts other than the RPC compression dict
Extend the `system.dicts` helper for querying and modifying
`system.dicts` with an ability to use names other than "general".
We will use that in later commits to publish dictionaries for SSTable compression.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
b77c611c00 raft/group0_state_machine: on system.dicts mutations, pass the affected partitition keys to the callback
Before this patch, `system.dicts` contains only one dictionary, for RPC
compression, with the fixed name "general".

In later parts of this series, we will add more dictionaries to
system.dicts, one per table, for SSTable compression.

To enable that, this patch adjusts the callback mechanism for group0's `write_mutations`
command, so that the mutation callbacks for group0-managed tables can see which
partition keys were affected. This way, the callbacks can query only the
modified partitions instead of doing a full scan. (This is necessary to
prevent quadratic behaviours.)

For now, only the `system.dicts` callback uses the partition keys.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
30a9d471fa sstables: plug an sstable_compressor_factory into sstables_manager
Create a `sstable_compressor_factory_impl` in `scylla_main`,
and pipe it through constructors into `sstables_manager`.

In next commits, the factory available through the `sstables_manager`
will be used to create compressors for SSTable readers and writers.
2025-04-01 00:07:28 +02:00
Robert Bindar
b647196121 Remove db::config::object_storage_config
That map became redundant once we added
object_storage_endpoints in the config, this patch removes
it and switches all the user code to use the new option.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 17:15:12 +03:00
Gleb Natapov
28fb84117d treewide: drop id parameter from gossiper::for_each_endpoint_state
We have it in endpoint_state anyway, so no need to pass both.
2025-03-31 16:50:50 +03:00
Gleb Natapov
4609bbbbb2 treewide: move gossiper to index nodes by host id
This patch changes gossiper to index nodes by host ids instead of ips.
The main data structure that changes is _endpoint_state_map, but this
results in a lot of changes since everything that uses the map directly
or indirectly has to be changed. The big victim of this outside of the
gossiper itself is topology over gossiper code. It works on IPs and
assumes the gossiper does the same and both need to be changed together.
Changes to other subsystems are much smaller since they already mostly
work on host ids anyway.
2025-03-31 16:50:50 +03:00
Gleb Natapov
0dd86b4f1d gossiper: move get_supported_features and its users to host id 2025-03-31 15:42:07 +03:00
Robert Bindar
e3a3508960 Move object_storage.yaml endpoints to scylla.yaml
This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 13:39:39 +03:00
Pavel Emelyanov
9aa986a49a snapshot-ctl: Remove unused snapshot-single-table method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-28 10:45:31 +03:00
Benny Halevy
62aeba759b tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for
new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`.

Refs scylladb/scylla-enterprise#4355

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-24 15:32:16 +02:00
Benny Halevy
c62865df90 db/config: add tablets_mode_for_new_keyspaces option
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-24 14:54:45 +02:00
Avi Kivity
7646e1448a Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.

The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
  keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
  so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
  keyspaces are RF-rack-valid. If not, the initialization fails.

We provide tests verifying that the changes behave as intended.

---

Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.

---

Implementation strategy (going beyond the scope of this PR):

1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.

---

Fixes scylladb/scylladb#20356
Fixes scylladb/scylladb#23276
Fixes scylladb/scylladb#23300

---

Backport: this is part of the requirements for releasing 2025.1.

Closes scylladb/scylladb#23138

* github.com:scylladb/scylladb:
  main: Refuse to start node when RF-rack-invalid keyspace exists
  cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
  db/config: Introduce RF-rack-valid keyspaces
2025-03-20 19:10:36 +02:00
Avi Kivity
a62ab824e6 schema: deprecate schema_extension
schema_extension allows making invisible changes to system_schema
that evade upgrade rollback tests. They appear in system_schema
as an encoded blob which reduces serviceability, as they cannot
be read.

Deprecate it and point users to adding explicit columns in scylla_tables.

We could probably make use of the data structure, after we teach it
to encode its payload into proper named and typed columns instead of
using IDL.

Closes scylladb/scylladb#23151
2025-03-19 20:36:16 +02:00
Dawid Mędrek
32879ec0d5 db/config: Introduce RF-rack-valid keyspaces
We introduce a new term in the glossary: RF-rack-valid keyspace.

We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.

Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.

The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.

Fixes scylladb/scylladb#20356
2025-03-19 14:46:35 +01:00
Piotr Dulikowski
2ca1c0b6f9 Merge 'introduce the new Raft-based recovery procedure for group 0 majority loss' from Patryk Jędrzejczak
This PR introduces the new Raft-based recovery procedure for group 0
majority loss.

The Raft-based recovery procedure works with tablets. The old
gossip-based recovery procedure does not because we have no code
for tablet migrations after the gossip-based topology changes.

The Raft-based procedure requires the Raft-based topology to be
enabled in the cluster. If the Raft-based topology is not enabled, the
gossip-based procedure must be used.

We will be able to get rid of the gossip-based procedure when we make
the Raft-based topology mandatory (we can do both in the same version,
2025.2 is the plan). Before we do it, we will have to keep both procedures
and explain when each of them should be used.

The idea behind the new procedure is to recreate group 0 without
touching the topology structures. Once we create a new group 0, we
can remove all dead nodes using the standard `removenode` and
`replace` operations.

For the procedure to be safe, we must ensure that each member of the
new group 0 moves to the same initial group 0 state. Also, the only safe
choice for the state is the latest persistent state available among the
live nodes.

The solution to the problem above is to ensure that the leader of the new
group 0 (called the recovery leader) is one of the nodes with the latest
state available. Other members will receive the snapshot from the
recovery leader when they join the new group 0 and move to its state.

Below is the shortened description of the new recovery procedure from
the perspective of the administrator. For the full description, refer to the
design document.
1. Find the set of live nodes.
2. Kill any live node that shouldn't be a member of the new group 0.
3. Ensure the full network connectivity between live nodes.
4. Rolling restart live nodes to ensure they are healthy and ready for
recovery.
5. Check if some data could have been lost. If yes, restore it from
backup after the recovery procedure.
6. Find the recovery leader (the node with the largest `group0_state_id`).
7. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the
recovery leader on each live node.
9. Rolling restart all live nodes, but the recovery leader must be
restarted first.
10. Remove all dead nodes using `removenode` or `replace`.
11. Unset `recovery_leader` on all nodes.
12. Delete data of the old group 0 from  `system.raft`,
`system.raft_snaphots`, and `system.raft_snapshot_config`.

In the future, we could automate some of these steps or even introduce
a tool that will do all (or most) of them by itself. For now, we are fine with
a procedure that is reliable and simple enough.

This PR makes using 2025.1 with tablets much safer. We want to
backport it to 2025.1. We will also want to backport a few follow-ups.

Fixes scylladb/scylladb#20657

Closes scylladb/scylladb#22286

* github.com:scylladb/scylladb:
  test: mark tests with the gossip-based recovery procedure
  test: add tests for the Raft-based recovery procedure
  test: topology: util: fix the tokens consistency check for left nodes
  test: topology: util: extend start_writes
  gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
  raft_group0: modify_raft_voter_status: do not add new members
  treewide: allow recreating group 0 in the Raft-based recovery procedure
2025-03-18 19:10:56 +01:00
Botond Dénes
2795d83b32 Merge 'commitlog: Serialize file deletion and distribute replayed segments' from Calle Wilund
Fixes #23017

When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either

a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback

where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed.

Now, eventually, this should be released, once we do deletion again, but this can take a while.

Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure.

As noted in the issue above, when replaying a large commitlog from an unclean node, we can cause shard 0
db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is
to simply give the segments to all CL shards, thus distributing the available space.

Closes scylladb/scylladb#23150

* github.com:scylladb/scylladb:
  main/commitlog: wait for file deletion and distribute recycled segments to shards
  commitlog: Serialize file deletion
2025-03-18 11:47:17 +02:00
Botond Dénes
fda3486770 Merge 'Remove some excessive ks:cf -> table_id conversions in API and schema_tables' from Pavel Emelyanov
Actually, the main goal of this PR was to remove parse_tables() helpers from api/ in favor of more flexible (yet same complex) parse_table_infos(), but it turned out that it also saves some lookups in database maps.

There are several places in API and schema_tables that have table_id at hand, but at some point drop it and carry keyspace and table names over to a place that maps ks:cf back to table_id and then uses it to find the table object. This PR keeps the table_id with the help of table_info struct in those places. This change allows removing the aforementioned parse_table() helpers from api/ and also saves few lookups in database maps.

Removing the parse_tables() from api/ is the continuation of previous effort that reduces the set of helpers in api/ code that help handlers "parse" keyspaces and tables names see #22742 #21533

Closes scylladb/scylladb#23216

* github.com:scylladb/scylladb:
  api: Remove the remaining parse_tables() overload
  database: Sanitize flush_tables_on_all_shards()
  schema_tables: Remove all_table_names()
  database: Make tables flushing helper use table_info-s, not names
  api: Make keyspace flush endpoint use parse_table_infos() (and a bit more)
  schema_tables,client_state: Switch to using all_table_infos()
  schema_tables: Tune up some methods to benefit from table_infos
  schema_tables: Introduce all_table_infos()
2025-03-17 15:40:41 +02:00
Calle Wilund
4ed81e05bf commitlog: Serialize file deletion
Fixes #23017

When deleting segments while our footprint is over the limit,
mainly when recycling/deleting segments after replay (recover
boot) we can cause two deletion passes to be running at the same
time. This is because delete is triggered by either

a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback

where the last one is in fact not even waited for. If we are
considering many files for delete/recycle, we can, due to task
switch, end up considering segments ok to keep, in parallel,
even though one of them should be deleted. The end result
will be us keeping one more segment than should be allowed.
Now, eventually, this should be released, once we do deletion
again, but this can take a while.

Solution is to simply ensure we serialize deletion. This might
cause some delay in processing cycles for recycle, but in
practice, this should never happen when we are in fact under
pressure.

Small unit test included.
2025-03-17 12:09:00 +00:00
Patryk Jędrzejczak
9970c1fcc3 gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
This patch ensures that members of the new group 0 can gossip with
members of the old group 0 during rolling restart in the Raft-based
recovery procedure. Without this change, restarted nodes (members of
the new group 0) wouldn't be marked as UP by other nodes (members of
the old group 0), which would decrease availability.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
fd51d7e448 treewide: allow recreating group 0 in the Raft-based recovery procedure
This patch adds support for recreating group 0 after losing
majority. This is the only part of the new Raft-based recovery
procedure that touches Scylla core.

The following steps are necessary to recreate group 0:
1. Determine the new group 0 members. These are alive nodes that
are normal or rebuilding.
2. Choose the recovery leader - the node which will become the
new group 0 leader. This must be one of the nodes with the
latest persistent group 0 state.
3. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
4. Set the new scylla.yaml parameter - `recovery_leader` - to Host
ID of the recovery leader on each live node.
5. Rolling restart all live nodes, but the recovery leader must be
restarted first.

In the implementation, restarts in step 5 are very similar to normal
restarts with the Raft-based topology enabled. The only differences
are:
1. Steps 3-4 make the restarting node discover the new group 0
in `join_cluster`.
2. The group 0 server is started in `join_group0`, not
`setup_group0_if_exists`.
3. The restarting node joins the new group 0 in `join_topology` using
`legacy_handshaker`. There is no reason to contact the topology
coordinator since the node has already joined the topology.

Unfortunately, this patch creates another execution path for the
starting logic. `join_cluster` becomes even messier. However, there
is nothing we can do about it. Joining group 0 without joining
topology is something completely new. Having a few small changes
without touching other execution paths is the best we can do.
We will start removing the old stuff soon, after making the
Raft-based topology mandatory, and the situation will improve.
2025-03-14 13:52:57 +01:00
Pavel Emelyanov
2bb455ec75 Merge 'Main: stop system_keyspace' from Benny Halevy
This series adds an async guard to system_keyspace operations
and adds a deferred action to stop the system_keyspace in main() before destroying the service.

This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped.

* Enhancement, no backport needed

Closes scylladb/scylladb#23113

* github.com:scylladb/scylladb:
  main: stop system keyspace
  system_keyspace: call shutdown from stop
  system_keyspace: shutdown: allow calling more than once
  database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
  utils: add class pluggable
2025-03-14 13:23:28 +03:00
Avi Kivity
696ce4c982 Merge "convert some parts of the gossiper to host ids" from Gleb
"
This is series starts conversion of the gossiper to use host ids to
index nodes. It does not touch the main map yet, but converts a lot of
internal code to host id. There are also some unrelated cleanups that
were done while working on the series. On of which is dropping code
related to old shadow round. We replaced shadow round with explicit
GOSSIP_GET_ENDPOINT_STATES verb in cd7d64f588
which is in scylla-4.3.0, so there should be no compatibility problem.
We already dropped a lot of old shadow round code in previous patches
anyway.

I tested manually that old and new node can co-exist in the same
cluster,
"

* 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits)
  gossiper: drop unneeded code
  gossiper: move _expire_time_endpoint_map to host_id
  gossiper: move _just_removed_endpoints to host id
  gossiper: drop unused get_msg_addr function
  messaging_service: change connection dropping notification to pass host id only
  messaging_service: pass host id to remove_rpc_client in down notification
  treewide: pass host id to endpoint_lifecycle_subscriber
  treewide: drop endpoint life cycle subscribers that do nothing
  load_meter: move to host id
  treewide: use host id directly in endpoint state change subscribers
  treewide: pass host id to endpoint state change subscribers
  gossiper: drop deprecated unsafe_assassinate_endpoint operation
  storage_service: drop unused code in handle_state_removed
  treewide: drop endpoint state change subscribers that do nothing
  gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory
  gossiper: start using host ids to send messages earlier
  messaging_service: add temporary address map entry on incoming connection
  topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
  treewide: move everyone to use host id based gossiper::is_alive and drop ip based one
  storage_proxy: drop unused template
  ...
2025-03-13 13:36:31 +02:00
Dawid Mędrek
0a6137218a db/hints: Cancel draining when stopping node
Draining hints may occur in one of the two scenarios:

* a node leaves the cluster and the local node drains all of the hints
  saved for that node,
* the local node is being decommissioned.

Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.

There are two reasons for that:

* Generally, in situations like that, we'd like to be able to shut down
  nodes as fast as possible. The data stored in the hints won't
  disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
  have the highest priority and it's reflected in the scheduling groups we
  use as well as the explicitly enforced throughput. If there are a large
  number of hints to be replayed, it might affect our tests.
  It's already happened, see: scylladb/scylladb#21949.

To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.

That should ensure all data is retained, while possibly speeding up
the shutdown process.

There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.

We also provide tests verifying that the implementation works as intended.

Fixes scylladb/scylladb#21949

Closes scylladb/scylladb#22811
2025-03-13 11:55:15 +02:00
Avi Kivity
b1d9f80d85 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
2025-03-11 14:34:27 +02:00
Gleb Natapov
0e3dcb7954 treewide: move everyone to use host id based gossiper::is_alive and drop ip based one 2025-03-11 12:09:21 +02:00
Gleb Natapov
e47f251178 gossiper: move _live_endpoints and _unreachable_endpoints endpoint to host_id
Index live and dead endpoints by host id. It also allows to simplify
some code that does a translation.
2025-03-11 12:09:21 +02:00
Pavel Emelyanov
89f3c1a91e database: Sanitize flush_tables_on_all_shards()
Previous patch left this method with few uglinesses

- the vector<table_id> argument is named table_names
- the sstring keyspace argument is unused
- the keyspace argument is captured for no use

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:13:10 +03:00
Pavel Emelyanov
0f9cc956f4 schema_tables: Remove all_table_names()
Now it's unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:12:56 +03:00
Pavel Emelyanov
c2d23d7948 database: Make tables flushing helper use table_info-s, not names
The database::flush_tables_on_all_shards() method accepts a keyspace
name and a vector of table names. Then it converts ks:cf pair for each
of the table name into a table-id and flushes the table with the ID.

All the callers of that method already have or can easily get the vector
of table_id-s, not just names, so make use of this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:11:32 +03:00
Pavel Emelyanov
5a897d7368 schema_tables,client_state: Switch to using all_table_infos()
There are few more places left that can use all_table_infos() as a
replacement for all_table_names(), patch them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:05:59 +03:00
Pavel Emelyanov
da05765746 schema_tables: Tune up some methods to benefit from table_infos
There are convert_schema_to_mutations() and calculate_schema_digest()
that collect table names and then use them to find schema and query
mutations from the table.

Both can use the newly introduced all_table_infos() and use the returned
table_id-s to do the same, thus avoiding re-lookups (which are fast
anyway, but still).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:01:50 +03:00
Pavel Emelyanov
d7bfa5a545 schema_tables: Introduce all_table_infos()
This method is like all_table_names(), but returns a vector of
table_info-s which is effectively a pair of string name and uuid id.
To be used later, and the string-returning all_table_name() will be
removed very soon too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 12:59:03 +03:00
Tomasz Grabiec
d01cc16d1e config, disk_space_monitor: Allow overriding capacity via config
Intended for testing, or hot-fixing out-of-space issues in production.

Tablet load balancer uses this information for determining per-shard load
so reducing capacity will cause tablets to be migrated away from the node.
2025-03-06 13:35:37 +01:00
Pavel Emelyanov
86b3e9b50b code: Move checked-file-impl.hh to util/
fixes: #22100

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23123
2025-03-06 10:22:05 +02:00
Pavel Emelyanov
e7d1ea3ab6 commitlog: Use shorter input stream creation overload
There's one that doesn't need the offset argument when it's 0

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23140
2025-03-06 08:06:42 +01:00
Benny Halevy
7a624e3df8 system_keyspace: call shutdown from stop
and use that to replace the explicit shutdown when stopped
in cql_test_env.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:23 +02:00
Benny Halevy
102aec64d5 system_keyspace: shutdown: allow calling more than once
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:22 +02:00
Benny Halevy
fba88bdd62 database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
To allow safe plug and unplug of the system_keyspace.

This patch follows-up on 917fdb9e53
(more specifically - f9b57df471)
Since just keeping a shared_ptr<system_keyspace> doesn't prevent
stopping the system_keyspace shards, while using the `pluggable`
interface allows safe draining of outstanding async calls
on shutdown, before stopping the system_keyspace.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:27:23 +02:00
Botond Dénes
71d8b7aa9f querier: demote tombstone warning for range-scans to debug level
Range scans are expected to go though lots of tombstones, no need to
spam the logs about this. The tombstone warning log is demoted to debug
level, if somebody wants to see it they can bump the logger to debug
level.

Fixes: https://github.com/scylladb/scylladb/issues/23093

Closes scylladb/scylladb#23094
2025-03-04 10:38:06 +03:00
Botond Dénes
6f7a069bce Merge 'Label basic metrics' from Amnon Heiman
This series is part of the effort to reduce the overall overhead originating from metrics reporting, both on the Scylla side and the metrics collecting server (Prometheus or similar)

The idea in this series is to create an equivalent of levels with a label.
First, label a subset of the metrics used by the dashboards.
Second, the per-table metrics that are now off by default will be marked with a different label.
The following specific optional features: CDC, CAS, and Alternator have a dedicated label now.
This will allow users to disable all metrics of features that are not in use.

All the rest of the metrics are left unlabeled.

Without any changes, users would get the same metrics they are getting today.
But you could pass the `__level=1` and get only those metrics the dashboard needs. That reduces between 50% and 70% (many metrics are hidden if not used, so the overall number of metrics varies).

The labels are not reported based on the seastar feature of hiding labels that start with an underscore.

Closes scylladb/scylladb#12246

* github.com:scylladb/scylladb:
  db/view/view.cc: label metrics with basic_level
  transport/server.cc: label metrics with basic_level
  service/storage_proxy.cc: label metrics with basic_level and cas
  main.cc: label metrics with basic_level
  streaming/stream_manager.cc: label metrics with basic_level
  repair/repair.cc: label metrics with basic_level
  service/storage_service.cc: label metrics with basic_level
  gms/gossiper.cc: label metrics with basic_level
  replica/database.cc: label metrics with basic_level
  cdc/log.cc: label metrics with basic_level and cdc
  alternator: label metrics with basic_level and alternator
  row_cache.cc: label metrics with basic_level
  query_processor.cc: label metrics with basic_level
  sstables.cc: label metrics with basic_level
  utils/logalloc.cc label metrics with basic_level
  commitlog.cc: label metrics with basic_level
  compaction_manager.cc: label metrics with basic_level
  Adding the __level and features labels
2025-03-04 09:32:11 +02:00
Calle Wilund
2f10205714 config: Enable optional TLS1.3 session ticket usage in cert setup
Refs #22916

Adds an "enable_session_tickets" option to TLS setup for our server
endpoints (not documented for internode RPC, as we don't handle it
on the client side there), allowing enabling of TLS3 client session
ticket, i.e. quicker reconnect.

Session tickets are valid within a time frame or until a node
restarts, whichever comes first.

v2:
Use "TLS1.3" in help message

Closes scylladb/scylladb#22928
2025-03-04 09:30:53 +02:00
Amnon Heiman
19a414598b db/view/view.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_view_builder_builds_in_progress

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
f40dc4e5c4 row_cache.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_cache_bytes_total
scylla_cache_bytes_used
scylla_cache_partition_evictions
scylla_cache_partition_hits
scylla_cache_partition_insertions
scylla_cache_partition_merges
scylla_cache_partition_misses
scylla_cache_partition_removals
scylla_cache_range_tombstone_reads
scylla_cache_reads
scylla_cache_reads_with_misses
scylla_cache_row_evictions
scylla_cache_row_hits
scylla_cache_row_insertions
scylla_cache_row_misses
scylla_cache_row_removals
scylla_cache_rows
scylla_cache_rows_merged_from_memtable
scylla_cache_row_tombstone_reads

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:38 +02:00