Commit Graph

355 Commits

Author SHA1 Message Date
Petr Gusev
8292ecf2e1 paxos_state: inline system_keyspace functions into paxos_store
Prepares for reusing the same functions to access either
system.paxos or a co-located table.
2025-07-24 16:39:50 +02:00
Botond Dénes
0753643606 db/system_keyspace: add apply_mutation()
Allow applying writes in the form of mutations directly to the keyspace.
Allows lower-level mutation API to build writes. Advantageous if writes
can contain large cells that would otherwise possibly cause large
allocation warnings if used via the internal CQL API.
2025-06-24 11:05:30 +03:00
Botond Dénes
92b5fe8983 db/system_keyspace: introduce the corrupt_data table
To serve as a place to store corrupt mutation fragments. These fragments
cannot be written to sstables, as they would be spread around by
compaction and/or repair. They even might make parsing the sstable
impossible. So they are stored in this special table instead, kept
around to be inspected later and possibly restored if possible.
2025-06-24 11:05:30 +03:00
Gleb Natapov
00fd427be0 topology request: make it possible to hold global request types in request_type field
topology_request table has a filed to hold a request type, but
currently it can hold only per node requests. This patch makes it
possible to store global request types there as well.
2025-06-09 13:38:49 +03:00
Gleb Natapov
3a496067c6 topology coordinator: move alter table global request parameters into topology_request table
Currently parameters to alter table global topology command are stored
in static column in the topology table, but this way there can be only one
outstanding alter table request. This patch moves the parameters to
the topology_request table where parameters are stored per request.
2025-06-09 13:38:49 +03:00
Łukasz Paszkowski
dc6f8881b8 system_keyspace: Extract compaction_history struct
Move the compaction_history_entry struct to a seperate file. The intent
of this change is to later re-use it in scylla-nodetool as it currently
defines its own structure that is very similar.
2025-05-14 08:31:40 +02:00
Łukasz Paszkowski
4c93b5292d system_keyspace: Squeeze update_compaction_history parameters
Since the number of statistics inserted into compaction_history
table grows in time, the number of parameters in the method
update_compaction_history grows as well.

So instead, let's re-use the already existing compaction_history_entry
structure to populate data from the compaction_manager to the
system table.
2025-05-14 08:31:40 +02:00
Piotr Dulikowski
ce2fab7cce main: make dc and rack immutable after bootstrap
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.

Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.

Fixes: scylladb/scylladb#23278
2025-04-17 16:22:26 +02:00
Avi Kivity
882f405eed Merge "Convert gossiper's endpoint state map to be host id based" from Gleb
"
The series makes endpoint state map in the gossiper addressable by host
id instead of ips. The transition has implication outside of the
gossiper as well. Gossiper based topology operations are affected by
this change since they assume that the mapping is ip based.

On wire protocol is not affected by the change as maps that are sent by
the gossiper protocol remain ip based. If old node sends two different
entries for the same host id the one with newer generation is applied.
If new node has two ids that are mapped to the same ip the newer one is
added to the outgoing map.

Interoperability was verified manually by running mixed cluster.

The series concludes the conversion of the system to be host id based.
"

* 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev:
  gossiper: make examine_gossiper private
  gossiper: rename get_nodes_with_host_id to get_node_ip
  treewide: drop id parameter from gossiper::for_each_endpoint_state
  treewide: move gossiper to index nodes by host id
  gossiper: drop ip from replicate function parameters
  gossiper: drop ip from apply_new_states parameters
  gossiper: drop address from handle_major_state_change parameter list
  gossiper: pass rpc::client_info to gossiper_shutdown verb handler
  gossiper: add try_get_host_id function
  gossiper: add ip to endpoint_state
  serialization: fix std::map de-serializer to not invoke value's default constructor
  gossiper: drop template from  wait_alive_helper function
  gossiper: move get_supported_features and its users to host id
  storage_service: make candidates_for_removal host id based
  gossiper: use peers table to detect address change
  storage_service: use std::views::keys instead of std::views::transform that returns a key
  gossiper: move _pending_mark_alive_endpoints to host id
  gossiper: do not allow to assassinate endpoint in raft topology mode
  gossiper: fix indentation after previous patch
  gossiper: do not allow to assassinate non existing endpoint
2025-04-02 12:30:00 +03:00
Michał Chojnowski
9d02e2c005 db/system_keyspace: add query_dict_timestamp
Adds a helper method which queries the creation timestamp
of a given dict in `system.dicts`.

We will later use the age of the current SSTable compression dict
to decide if another training should be done already.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
bea866a46f main: clean up sstable compression dicts after table drops
When a table is dropped, its corresponding dictionary in `system.dicts`
-- if any -- should be deleted, otherwise it will remain forever as
garbage.

This commit implements such cleanup.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
4856f4acca db/system_keyspace: let system.dicts helpers be used for dicts other than the RPC compression dict
Extend the `system.dicts` helper for querying and modifying
`system.dicts` with an ability to use names other than "general".
We will use that in later commits to publish dictionaries for SSTable compression.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
b77c611c00 raft/group0_state_machine: on system.dicts mutations, pass the affected partitition keys to the callback
Before this patch, `system.dicts` contains only one dictionary, for RPC
compression, with the fixed name "general".

In later parts of this series, we will add more dictionaries to
system.dicts, one per table, for SSTable compression.

To enable that, this patch adjusts the callback mechanism for group0's `write_mutations`
command, so that the mutation callbacks for group0-managed tables can see which
partition keys were affected. This way, the callbacks can query only the
modified partitions instead of doing a full scan. (This is necessary to
prevent quadratic behaviours.)

For now, only the `system.dicts` callback uses the partition keys.
2025-04-01 00:07:29 +02:00
Gleb Natapov
0dd86b4f1d gossiper: move get_supported_features and its users to host id 2025-03-31 15:42:07 +03:00
Benny Halevy
7a624e3df8 system_keyspace: call shutdown from stop
and use that to replace the explicit shutdown when stopped
in cql_test_env.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:23 +02:00
Benny Halevy
102aec64d5 system_keyspace: shutdown: allow calling more than once
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:22 +02:00
Abhinav Jha
e491950c47 raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
In the current scenario, topology_change_kind variable, was been handled using
 _manage_topology_change_kind_from_group0 variable. This method was brittle
and had some bugs(e.g. for restart case, it led to a time gap between group0
server start and topology_change_kind being managed via group0)

Post _manage_topology_change_kind_from_group0 removal, careful management of
topology_change_kind variable was needed for maintaining correct
topology_change_kind in all scenarios. So this PR also performs a refactoring
to populate all init data to system tables even before group0 creation(via
raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader
happens before the group 0 creation, we write mutations directly to system
tables instead of a group 0 command. Hence, post group0 creation, the node
can read the correct values from system tables and correct values are
maintained throughout.

Added a new function initialize_done_topology_upgrade_state which takes
care of updating the correct upgrade state to system tables before starting
group0 server. This ensures that the node can read the correct values from
system tables and correct values are maintained throughout.

By moving raft_initialize_discovery_leader logic to happen before starting
group0 server, and not as group0 command post server start, we also get rid
of the potential problem of init group0 command not being the 1st command on
the server. Hence ensuring full integrity as expected by programmer.

Fixes: scylladb/scylladb#21114
2025-02-14 16:56:17 +05:30
Botond Dénes
b3f8c4faa7 Merge 'node_ops: filter topology_requests entries shown by node_ops_virtual_task' from Aleksandra Martyniuk
node_ops_virtual_task does not filter the entries of system.topology_request
and so it creates statuses of operations that aren't node ops.

Filter the entries used by node_ops_virtual_task. With this change, the status
of a bootstrap of the first node will not be visible.

Fixes: https://github.com/scylladb/scylladb/issues/22008.

Needs backport to 6.2 that introduced node_ops_virtual_task

Closes scylladb/scylladb#22009

* github.com:scylladb/scylladb:
  test: truncate the table before node ops task checks
  node_ops: rename a method that get node ops entries
  node_ops: filter topology_requests entries
2025-01-07 14:17:01 +02:00
Michał Chojnowski
cc15ca329e db/system_keyspace: add system.dicts
Adds a new system table which will act as the medium for distributing
compression dictionaries over the cluster.

This table will be managed by Raft (group 0). It will be hooked up to it in
follow-up commits.
2024-12-23 23:37:02 +01:00
Aleksandra Martyniuk
ee4bd287fd node_ops: rename a method that get node ops entries 2024-12-20 12:25:48 +01:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Ferenc Szili
3ac44109e3 system.topology_requests: change schema
This commit adds the new column in the system.topology_requests
table which are needed for the new global topology request.
2024-12-04 11:30:06 +01:00
Gleb Natapov
fbaf0a3cce group0: move group0 upgrade code to host ids
Drop unneeded ip to id translation.
2024-12-02 10:31:13 +02:00
Tomasz Grabiec
8738d9bfa0 system_tables: Compute schema version automatically
This depends on the previous change to the schema_builder
which makes version computation depend on definition only
instead of being new time uuid.

This way we avoid the possibility for a common mistake
when schema of a system table is extended but we forget
to bump up its version passed to .with_version().
2024-11-15 19:16:41 +01:00
Pavel Emelyanov
3315e3a2a9 system_keyspace: Change sstables registry partition key type
Today, the system.sstables schema uses string as partition key. Callers,
in turn, use table's datadir value to reference entries in it. That's
wrong, S3-backed sstables don't have any local paths to work with. The
table's ID is better in this role.

This patch only changes the field type to be table_id and fixes the
callers to provide one. In particular, see init_table_storage() change
-- instead of generating a datadir string, it sets table.id() as the
options' location. Other fixed places are tests. Internally, this id
value is propagated via s3_storage::owner() method, that's fixed as
well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 13:48:09 +03:00
Avi Kivity
df3ee94467 db: system_keyspace: extract auth_version_t into its own header
Users of auth_version_t shouldn't need to include the heavyweight
system_keyspace.hh.
2024-09-28 16:31:50 +03:00
Michael Litvak
c1f3517a75 view_builder: improve migration to v2 with intermediate phase
Add an intermediate phase to the view builder migration to v2 where we
write to both the old and new table in order to not lose writes during
the migration.
We add an additional view builder version v1_5 between v1 and v2 where
we write to both tables. We perform a barrier before moving to v2 to
ensure all the operations to the old table are completed.
2024-09-05 15:42:35 +03:00
Michael Litvak
f3887cd80b db:system_keyspace: add view_builder_version to scylla_local
Add a new scylla_local parameter view_builder_version, and functions to
read and mutate the value.
The version value defaults to v1 if it doesn't exist in the table.
2024-09-05 15:41:04 +03:00
Michael Litvak
bf4a58bf91 db:system_keyspace: add view_build_status_v2 table
add the table system.view_build_status_v2 with the same schema as
system_distributed.view_build_status.
2024-09-05 15:41:04 +03:00
Laszlo Ersek
9e95f3a198 db/system_keyspace: promise a tagged integer from increment_and_get_generation()
Internally, increment_and_get_generation() produces a
"gms::generation_type" value.

In turn, all callers of increment_and_get_generation() -- namely
scylla_main() [main.cc] and single_node_cql_env::run_in_thread()
[test/lib/cql_test_env.cc] -- pass the resolved value to
storage_service::init_address_map() and storage_service::join_cluster(),
both of which take a "gms::generation_type".

Therefore it is pointless to "untag" the generation value temporarily
between the producer and the consumers. Correct the return type of
increment_and_get_generation().

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Aleksandra Martyniuk
c64cb98bcf db: node_ops: filter topology request entries
system_keyspace::get_topology_request_entries returns entries for
requests which are running or have finished after specified time.

In task manager node ops task set the time so that they are shown
for task_ttl seconds after they have finished.
2024-07-23 13:35:02 +02:00
Aleksandra Martyniuk
94282b5214 db: service: modify methods to get topology_requests data
Modify get_topology_request_state (and wait_for_topology_request_completion),
so that it doesn't call on_internal_error when request_id isn't
in the topology_requests table if require_entry == false.

Add other methods to get topology request entry.
2024-07-23 13:35:01 +02:00
Marcin Maliszkiewicz
2ab143fb40 db: auth: move auth tables to system keyspace
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.

Fixes https://github.com/scylladb/scylladb/issues/18098

Closes scylladb/scylladb#18769
2024-05-26 22:30:42 +03:00
Benny Halevy
1061455442 gossiper: add load_endpoint_state
Pack the topology-related data loaded from system.peers
in `gms::load_endpoint_state`, to be used in a following
patch for `add_saved_endpoint`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:06:56 +03:00
Marcin Maliszkiewicz
562caaf6c6 auth: keep auth version in scylla_local
Before the patch selection of auth version depended
on consistent topology feature but during raft recovery
procedure this feature is disabled so we need to persist
the version somewhere to not switch back to v1 as this
is not supported.

During recovery auth works in read-only mode, writes
will fail.
2024-04-02 19:04:21 +02:00
Michał Jadwiszczak
dab909b1d1 service:qos: store information about sl data migration
Save information whether service levels data was migrated to v2 table.
The information is stored in `system.scylla_local` table. It's
written with raft command and included in raft snapshot.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
8e242f5acd db:system_keyspace: add SERVICE_LEVELS_V2 table
The table has the same schema as `system_distributed.service_levels`.
However it's created entirely at once (unlike old table which creates
base table first and then it adds other columns) because `system` tables
are local to the node.
2024-03-21 23:14:57 +01:00
Avi Kivity
e48eb76f61 sstables_manager: decouple from system_keyspace
sstables_manager now depends on system_keyspace for access to the
system.sstables table, needed by object storage. This violates
modularity, since sstables_manager is a relatively low-level leaf
module while system_keyspace integrates large parts of the system
(including, indirectly, sstables_manager).

One area where this is grating is sstables::test_env, which has
to include the much higher level cql_test_env to accommodate it.

Fix this by having sstables_manager expose its dependency on
system_keyspace as an interface, sstables_registry, and have
system_keyspace implement the glue logic in
system_keyspace_sstables_manager.

Closes scylladb/scylladb#17868
2024-03-18 20:38:07 +03:00
Tomasz Grabiec
61b3453552 raft topology: Keep nodes in the left state to topology
Those nodes will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:

 1) algorithms which work with replica sets filter nodes based on
 their location. For example materialized views code which pairs base
 replicas with view replicas filters by datacenter first.

 2) tablet scheduler needs to identify each node's location in order
 to make decisions about new replica placement.

It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.

Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).

In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet
replica sets.

We load topology infromation only for left nodes which are actually
referenced by any tablet. To achieve that, topology loading code
queries system.tablet for the set of hosts. This set is then passed to
system.topology loading method which decides whether to load
replica_state for a left node or not.
2024-03-15 11:05:29 +01:00
Marcin Maliszkiewicz
ebb0ffeb6c auth: implement auth-v2 migration
During raft topology upgrade procedure data from
system_auth keyspace will be migrated to system_auth_v2.

Migration works mostly on top of CQL layer to minimize
amount of new code introduced, it mostly executes SELECTs
on old tables and then INSERTs on new tables. Writes are
not executed as usual but rather announced via raft.
2024-03-01 16:25:14 +01:00
Marcin Maliszkiewicz
5a6d4dbc37 storage_service: add support for auth-v2 raft snapshots
This patch adds new RPC for pulling snapshot of auth tables.
2024-03-01 16:25:14 +01:00
Patryk Jędrzejczak
b8aa74f539 raft topology: clean only obsolete CDC generations' data
Currently, we may clear a CDC generation's data from
CDC_GENERATIONS_V3 if it is not the last committed generation
and it is at least 24 hours old (according to the topology
coordinator's clock). However, after allowing writes to the
previous CDC generations, this condition became incorrect. We
might clear data of a generation that could still be written to.

The new solution is to clear data of the generations that
finished operating more than 24 hours ago. The rationale behind
it is in the new comment in
`topology_coordinator:clean_obsolete_cdc_generations`.

The previous solution used the clean-up candidate. After
introducing `committed_cdc_generations`, it became unneeded.
The last obsolete generation can be computed in
`topology_coordinator:clean_obsolete_cdc_generations`. Therefore,
we remove all the code that handles the clean-up candidate.

After changing how we clear CDC generations' data,
`test_current_cdc_generation_is_not_removed` became obsolete.
The tested feature is not present in the code anymore.

`test_dependency_on_timestamps` became the only test case covering
the CDC generation's data clearing. We adjust it after the changes.
2024-02-20 12:35:18 +01:00
Petr Gusev
86410d71d1 system_keyspace::update_peer_info: make host_id an explicit parameter
The host_id field should always be set, so it's more
appropriate to pass it as a separate parameter.

The function storage_service::get_peer_info_for_update
is  updated. It shouldn't look for host_id app
state is the passed map, instead the callers should
get the host_id on their own.
2024-02-15 13:21:04 +04:00
Petr Gusev
4a14988735 system_keyspace: remove duplicate ips for host_id
When a node changes IP we call sync_raft_topology_nodes
from raft_ip_address_updater::on_endpoint_change with
the old IP value in prev_ip parameter.
It's possible that the nodes crashes right after
we insert a new IP for the host_id, but before we
remove the old IP. In this commit we fix the
possible inconsistency by removing the system.peers
record with old timestamp. This is what the new
peers_table_read_fixup function is responsible for.

We call this function in all system_keyspace methods
that read the system.peers table. The function
loads the table in memory, decides if some rows
are stale by comparing their timestamps and
removes them.

The new function also removes the records with no
host_id, so we no longer need the get_host_id function.

We'll add a test for the problem this commit fixes
in the next commit.
2024-02-15 13:21:04 +04:00
Piotr Dulikowski
fb02453686 system_keyspace: add read_cdc_generation_opt
The `system_keyspace::read_cdc_generation` loads a cdc generation from
the system tables. One of its preconditions is that the generation
exists - this precondition is quite easy to satisfy in raft mode, and
the function was designed to be used solely in that mode.

In legacy mode however, in case when we revert from raft mode through
recovery, it might be necessary to use generations created in raft mode
for some time. In order to make the function useful as a fallback in
case lookup of a generation in legacy mode fails, introduce a relaxed
variant of `read_cdc_generation` which returns std::nullopt if the
generation does not exist.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
3513a07d8a feature_service: fall back to checking legacy features on startup
When checking features on startup (i.e. whether support for any feature
was revoked in an unsafe way), it might happen that upgrade to raft
topology didn't finish yet. In that case, instead of loading an empty
set of features - which supposedly represents the set of features that
were enabled until last boot - we should fall back to loading the set
from the legacy `enabled_features` key in `system.scylla_local`.
2024-02-08 19:12:28 +01:00
Michał Chojnowski
7c5a8894be db: system_keyspace: add system.commitlog_cleanups
Add a system table which will hold records of cleanup operations,
for the purpose of filtering commitlog replays to avoid data
resurrections.
2024-01-24 10:37:38 +01:00
Gleb Natapov
84197ff735 storage_service: topology coordinator: check topology operation completion using status in topology_requests table
Instead of trying to guess if a request completed by looking into the
topology state (which is sometimes can be error prone) look at the
request status in the new topology_requests. If request failed report
a reason for the failure from the table.
2024-01-16 17:02:54 +02:00
Gleb Natapov
ecb8778950 system keyspace: introduce local table to store topology requests status
The table has the following schema and will be managed by raft:

CREATE TABLE system.topology_requests (
    id timeuuid PRIMARY KEY,

    initiating_host uuid,
    start_time timestamp,

    done boolean,
    error text,
    end_time timestamp,
);

In case of an request completing with an error the "error" filed will be non empty when "done" is set to true.
2024-01-16 13:57:16 +02:00
Kefu Chai
be364d30fd db: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16664
2024-01-09 11:44:19 +02:00