Commit Graph

3655 Commits

Author SHA1 Message Date
Avi Kivity
4ddf82e58b treewide: don't #include "gms/feature_service.hh" from other headers
feature_service.hh is a high-level header that integrates much
of the system functionality, so including it in lower-level headers
causes unnecessary rebuilds. Specifically, when retiring features.

Fix by removing feature_service.hh from headers, and supply forward
declarations and includes in .cc where needed.

Closes scylladb/scylladb#18005
2024-03-26 15:31:18 +02:00
Pavel Emelyanov
8bf9098663 system_keyspace: Consolidate node-state vs tokens checks
When loading topology state, nodes are checked to have or not to have
"tokens" field set. The check is done based on node state and it's
spread across the loading method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17957
2024-03-26 14:55:46 +02:00
Kefu Chai
f3532cbaa0 db: commitlog: use fmt::streamed() to print segment
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change:

* add `format_as()` for `segment` so we can use it as a fallback
  after upgrading to {fmt} v10
* use fmt::streamed() when formatting `segment`, this will be used
  the intermediate solution before {fmt} v10 after dropping
  `FMT_DEPRECATED_OSTREAM` macro

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18019
2024-03-26 12:13:29 +02:00
Wojciech Mitros
9789a3dc7c mv: keep semaphore units alive until the end of a remote view update
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.

Fixes #17890

Closes scylladb/scylladb#17891
2024-03-25 19:43:58 +02:00
Piotr Dulikowski
f23f8f81bf Merge 'Raft-based service levels' from Michał Jadwiszczak
This patch introduces raft-based service levels.

The difference to the current method of working is:
- service levels are stored in `system.service_levels_v2`
- reads are executed with `LOCAL_ONE`
- writes are done via raft group0 operation

Service levels are migrated to v2 in topology upgrade.
After the service levels are migrated, `key: service_level_v2_status; value: data_migrated` is written to `system.scylla_local` table. If this row is present, raft data accessor is created from the beginning and it handles recovery mode procedure (service levels will be read from v2 table even if consistent topology is disabled then)

Fixes #17926

Closes scylladb/scylladb#16585

* github.com:scylladb/scylladb:
  test: test service levels v2 works in recovery mode
  test: add test for service levels migration
  test: add test for service levels snapshot
  test:topology: extract `trigger_snapshot` to utils
  main: create raft dda if sl data was migrated
  service:qos: store information about sl data migration
  service:qos: service levels migration
  main: assign standard service level DDA before starting group0
  service:qos: fix `is_v2()` method
  service:qos: add a method to upgrade data accessor
  test: add unit_test_raft_service_levels_accessor
  service:storage_service: add support for service levels raft snapshot
  service:qos: add abort_source for group0 operations
  service:qos: raft service level distributed data accessor
  service:qos: use group0_guard in data accessor
  cql3:statements: run service level statements on shard0 with raft guard
  test: fix overrides in unit_test_service_levels_accessor
  service:qos: fix indentation
  service:qos: coroutinize some of the methods
  db:system_keyspace: add `SERVICE_LEVELS_V2` table
  service:qos: extract common service levels' table functions
2024-03-22 11:51:53 +01:00
Kamil Braun
4359a1b460 Merge 'raft timeouts: better handling of lost quorum' from Petr Gusev
In this PR we add timeouts support to raft groups registry. We introduce
the `raft_server_with_timeouts` class, which wraps the `raft::server`
add exposes its interface with additional `raft_timeout` parameter. If
it's set, the wrapper cancels the `abort_source` after certain amount of
time. The value of the timeout can be specified either in the
`raft_timeout` parameter, or the default value can be set in `the
raft_server_with_timeouts` class constructor.

The `raft_group_registry` interface is extended with
`group0_with_timeouts()` method. It returns an instance of
`raft_server_with_timeouts` for group0 raft server. The timeout value
for it is configured in `create_server_for_group0`. It's one minute by
default and can be overridden for tests with
`group0-raft-op-timeout-in-ms` parameter.

The new api allows the client to decide whether to use timeouts or not.
In this PR we are reviewing all the group0 call sites and add
`raft_timeout` if that makes sense. The general principle is that if the
code is handling a client request and the client expects a potential
error, we use timeouts. We don't use timeouts for background fibers
(such as topology coordinator), since they wouldn't add much value. The
only thing the background fiber can do with a timeout is to retry, and
this will have the same end effect as not having a timeout at all.

Fixes scylladb/scylladb#16604

Closes scylladb/scylladb#17590

* github.com:scylladb/scylladb:
  migration_manager: use raft_timeout{}
  storage_service::join_node_response_handler: use raft_timeout{}
  storage_service::start_upgrade_to_raft_topology: use raft_timeout{}
  storage_service::set_tablet_balancing_enabled: use raft_timeout{}
  storage_service::move_tablet: use raft_timeout{}
  raft_check_and_repair_cdc_streams: use raft_timeout{}
  raft_timeout: test that node operations fail properly
  raft_rebuild: use raft_timeout{}
  do_cluster_cleanup: use raft_timeout{}
  raft_initialize_discovery_leader: use raft_timeout{}
  update_topology_with_local_metadata: use with_timeout{}
  raft_decommission: use raft_timeout{}
  raft_removenode: use raft_timeout{}
  join_node_request_handler: add raft_timeout to make_nonvoters and add_entry
  raft_group0: make_raft_config_nonvoter: add raft_timeout parameter
  raft_group0: make_raft_config_nonvoter: add abort_source parameter
  manager_client: server_add with start=false shouldn't call driver_connect
  scylla_cluster: add seeds parameter to the add_server and servers_add
  raft_server_with_timeouts: report the lost quorum
  join_node_request_handler: add raft_timeout{} for start_operation
  skip_mode: add platform_key
  auth: use raft_timeout{}
  raft_group0_client: add raft_timeout parameter
  raft_group_registry: add group0_with_timeouts
  utils: add composite_abort_source.hh
  error_injection: move api registration to set_server_init
  error_injection: add inject_parameter method
  error_injection: move injection_name string into injection_shared_data
  error_injection: pass injection parameters at startup
2024-03-22 10:45:33 +01:00
Michał Jadwiszczak
dab909b1d1 service:qos: store information about sl data migration
Save information whether service levels data was migrated to v2 table.
The information is stored in `system.scylla_local` table. It's
written with raft command and included in raft snapshot.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
8e242f5acd db:system_keyspace: add SERVICE_LEVELS_V2 table
The table has the same schema as `system_distributed.service_levels`.
However it's created entirely at once (unlike old table which creates
base table first and then it adds other columns) because `system` tables
are local to the node.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
990c5e7dd0 service:qos: extract common service levels' table functions
Getting a service level(s) will be done the same way in raft-based
service levels as it's done in standard service levels, so those
funtions are extracted to reused it.
2024-03-21 23:14:57 +01:00
Pavel Emelyanov
21a5911e60 Merge 'db/virtual_tables: make token_ring_table tablet aware' from Botond Dénes
The token ring table is a virtual table (`system.token_ring`), which contains the ring information for all keyspaces in the system. This is essentially an alternative to `nodetool describering`, but since it is a virtual table, it allows for all the usual filtering/aggregation/etc. that CQL supports.
Up until now, this table only supported keyspaces which use vnodes. This PR adds support for tablet keyspaces. To accommodate these keyspaces a new `table_name` column is added, which is set to `ALL` for vnodes keyspaces. For tablet keyspaces, this contains the name of the table.
Simple sanity tests are added for this virtual table (it had none).

Fixes: #16850

Closes scylladb/scylladb#17351

* github.com:scylladb/scylladb:
  test/cql-pytest: test_virtual_tables: add test for token_ring table
  db/virtual_tables: token_ring_table: add tablet support
  db/virtual_tables: token_ring_table: add table_name column
  db/virtual_tables: token_ring_table: extract ring emit
  service/storage_service: describe_ring_for_table(): use topology to map hostid to ip
2024-03-20 14:05:49 +03:00
Petr Gusev
49a4220fea error_injection: pass injection parameters at startup
Injection parameters can be used in the lambda passed to
inject_with_handler method to take some values from
the test. However, there was no way to set values to these
parameters on node startup, only through
the error injection REST api. Therefore, we couldn't rely
on this when inject_with_handler is used during
node startup, it could trigger before we call the api
from the test.

In this commit with solve this problem by allowing these
parameters to be assigned through scylla.yaml config.

The defer.hh header was added to error_injection.hh to fix
compilation after adding error_injection.hh to config.hh,
defer function is used in error_injection.hh.
2024-03-19 20:17:02 +04:00
Avi Kivity
e48eb76f61 sstables_manager: decouple from system_keyspace
sstables_manager now depends on system_keyspace for access to the
system.sstables table, needed by object storage. This violates
modularity, since sstables_manager is a relatively low-level leaf
module while system_keyspace integrates large parts of the system
(including, indirectly, sstables_manager).

One area where this is grating is sstables::test_env, which has
to include the much higher level cql_test_env to accommodate it.

Fix this by having sstables_manager expose its dependency on
system_keyspace as an interface, sstables_registry, and have
system_keyspace implement the glue logic in
system_keyspace_sstables_manager.

Closes scylladb/scylladb#17868
2024-03-18 20:38:07 +03:00
Avi Kivity
72bbe75d5b Merge 'Fix node replace with tablets for RF=N' from Tomasz Grabiec
This PR fixes a problem with replacing a node with tablets when
RF=N. Currently, this will fail because tablet replica allocation for
rebuild will not be able to find a viable destination, as the replacing node
is not considered to be a candidate. It cannot be a candidate because
replace rolls back on failure and we cannot roll back after tablets
were migrated.

The solution taken here is to not drain tablet replicas from replaced
node during topology request but leave it to happen later after the
replaced node is in left state and replacing node is in normal state.

The replacing node waits for this draining to be complete on boot
before the node is considered booted.

Fixes https://github.com/scylladb/scylladb/issues/17025

Nodes in the left state will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:

 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first.

 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement.

It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.

Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).

In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet replica sets.

Currently left nodes are never removed from topology, so will
accumulate in memory. We could garbage-collect them from topology
coordinator if a left node is absent in any replica set. That means we
need a new state - left_for_real.

Closes scylladb/scylladb#17388

* github.com:scylladb/scylladb:
  test: py: Add test for view replica pairing after replace
  raft, api: Add RESTful API to query current leader of a raft group
  test: test_tablets_removenode: Verify replacing when there is no spare node
  doc: topology-on-raft: Document replace behavior with tablets
  tablets, raft topology: Rebuild tablets after replacing node is normal
  tablets: load_balancer: Access node attributes via node struct
  tablets: load_balancer: Extract ensure_node()
  mv: Switch to using host_id-based replica set
  effective_replication_map: Introduce host_id-based get_replicas()
  raft topology: Keep nodes in the left state to topology
  tablets: Introduce read_required_hosts()
2024-03-18 16:16:08 +02:00
Wojciech Mitros
efcb718e0a mv: adjust memory tracking of single view updates within a batch
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.

Fixes #17854

Closes scylladb/scylladb#17855
2024-03-18 14:31:54 +02:00
Avi Kivity
731b5c5120 schema_tables: unfreeze frozen_mutation:s gently
With large schemas, unfreezing can stall, especially as it requires
a lot of memory. Switch to a gentle version that will not stall.
2024-03-17 17:46:02 +02:00
Kefu Chai
8bab51733f db: add fmt::formatter for db::functions::function
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `db::functions::function`.
please note, because we use `std::ostream` as the parameter of
the polymorphism implementation of `function::print()`.
without an intrusive change, we have to use `fmt::ostream_formatter`
or at least use similar technique to format the `function` instance
into an instance of `ostream` first. so instead of implementing
a "native" `fmt::formatter`, in this change, we just use
`fmt::ostream_formatter`.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17832
2024-03-16 17:36:49 +02:00
Tomasz Grabiec
9b656ec2aa mv: Switch to using host_id-based replica set
This is necessary to not break replica pairing between base and
view. After replacing a node, tablet replica set contains for a while
the replaced node which is in the left state. This node is not
returned by the IP-based get_natural_endpoints() so the replica
indexes would shift, changing the pairing with the view.

The host_id-based replica set always has stable indexes for replicas.
2024-03-15 11:05:29 +01:00
Tomasz Grabiec
61b3453552 raft topology: Keep nodes in the left state to topology
Those nodes will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:

 1) algorithms which work with replica sets filter nodes based on
 their location. For example materialized views code which pairs base
 replicas with view replicas filters by datacenter first.

 2) tablet scheduler needs to identify each node's location in order
 to make decisions about new replica placement.

It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.

Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).

In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet
replica sets.

We load topology infromation only for left nodes which are actually
referenced by any tablet. To achieve that, topology loading code
queries system.tablet for the set of hosts. This set is then passed to
system.topology loading method which decides whether to load
replica_state for a left node or not.
2024-03-15 11:05:29 +01:00
Botond Dénes
279e496133 db/virtual_tables: token_ring_table: add tablet support
For keyspaces which use tablets, we describe each table separately.
2024-03-15 04:23:20 -04:00
Botond Dénes
61b6ac7ffe db/virtual_tables: token_ring_table: add table_name column
As the first clustering column. For vnode keyspaces, this will always be
"ALL", for tablet keyspaces, this will contain the name of the described
table.
2024-03-15 04:23:20 -04:00
Botond Dénes
fdef62c232 db/virtual_tables: token_ring_table: extract ring emit
Into a separate method. For vnodes there is a single ring per keyspace,
but for tablets, there is a separate ring for each table in the
keyspace. To accomodate both, we move the code emitting the ring into a
separate method, so execute() can just call it once per keyspace or once
per table, whichever appropriate.
2024-03-15 04:23:20 -04:00
Tomasz Grabiec
8c5d088928 Merge 'Drop tablets of dropped views and indices' from Benny Halevy
This series adds notification before dropping views and indices so that the
tablet_allocator can generate mutations to respectively drop all tablets associated with them from system.tablets.

Additional unit tests were added for these cases.

Note that one case is not yet tested: where a table is allowed to be dropped while having views that depend on it, when it is dropped from the alternator path.

This is tested indirectly by testing dropping a table with live secondary index as it follows the same notification path as views in this series.

Fixes #17627

Closes scylladb/scylladb#17773

* github.com:scylladb/scylladb:
  migration_manager: notify before_drop_column_family when dropping indices
  schema_tables: make_update_indices_mutations: use find_schema to lookup the view of dropped indices
  migration_manager: notify before_drop_column_family before dropping views
  cql-pytest: test_tablets: add test_tablets_are_dropped_when_dropping_table
  tablet_allocator: on_before_drop_column_family: remove unused result variable
2024-03-14 22:52:29 +01:00
Benny Halevy
5bfca73b30 migration_manager: notify before_drop_column_family when dropping indices
Fixes #17627

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-14 20:19:12 +02:00
Benny Halevy
9cf6a2e510 schema_tables: make_update_indices_mutations: use find_schema to lookup the view of dropped indices
When dropping indices, we don't need to go through
`create_view_for_index` in order to drop the index.
That actually creates a new schema for this view
which is used just for its metadata for generating mutations
dropping it.

Instead, use `find_schema` to lookup the current schema
for the dropped index.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-14 20:19:11 +02:00
Piotr Smaron
ad2d039e3d db: move all group 0 tables to schema commitlog
This is to have durability for the group0 tables.
But also because I need it specifially to make
`system.topology` & `system_schema.scylla_keyspaces`
mutations under a single raft command in https://github.com/scylladb/scylladb/pull/16723

Fixes: #15596

Closes scylladb/scylladb#17783
2024-03-14 13:33:30 +01:00
Kefu Chai
926fe29ebd db: commitlog: add fmt::formatter for commitlog types
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for

* db::commitlog::segment::cf_mark
* db::commitlog::segment_manager::named_file
* db::commitlog::segment_manager::dispose_mode
* db::commitlog::segment_manager::byte_flow<T>

please note, the formatter of `db::commitlog::segment` is not
included in this commit, as we are formatting it in the inline
definition of this class. so we cannot define the specialization
of `fmt::formatter` for this class before its callers -- we'd
either use `format_as()` provided by {fmt} v10, or use `fmt::streamed`.
either way, it's different from the theme of this commit, and we
will handle it in a separated commit.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17792
2024-03-14 09:28:12 +02:00
Pavel Emelyanov
3a734facc7 view_builder: Complete build step early if reader produces nothing
Builder works in "steps". Each step runs for a given base table, when a
new view is created it either initiates a step or appends to currently
running step.

Running a step means reading mutations from local sstables reader and
applying them to all views that has jumped into this step so far. When a
view is added to the step it remembers the current token value the step
is on. When step receives end-of-stream it rewinds to minimal-token.
Rewinding is done by closing current reader and creating a new one. Each
time token is advanced, all the views that meet the new token value for
the second time (i.e. -- scan full round) are marked as built and are
removed from step. When no views are left on step, it finishes.

The above machinery can break when rewinding the end-of-stream reader.
The trick is that a running step silently assumes that if the reader
once produced some token (and there can be a view that remembered this
token as its starting one), then after rewinding the reader would
generate the same token or greater. With tablets, however, that's not
the case. When a node is decommissioned tablets are cleaned and all
sstables are removed. Rewinding a reader after it makes empty reader
that produces no tokens from now on. Respectively, any build steps that
had captured tokens prior to cleanup would get stuck forever.

The fix is to check if the mutation consumer stepped at least one step
forward after rewind, and if no -- complete all the attached views.

fixes: #17293

Similar thing should happen if the base table is truncated with views
being built from it. Testing it steps on compaction assertion elsewhere
and needs more research.

refs: #17543

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17548
2024-03-12 14:58:47 +02:00
Nadav Har'El
19bcea6216 materialized views: fix rare failure caused by empty update
This one-line patch fixes a failure in the dtest

        lwt_schema_modification_test.py::TestLWTSchemaModification
        ::test_table_alter_delete

Where an update sometimes failed due to an internal server error, and the
log had the mysterious warning message:

        "std::logic_error (Empty materialized view updated)"

We've also seen this log-message in the past in another user's log, and
never understood what it meant.

It turns out that the error message was generated (and warning printed)
while building view updates for a base-table mutation, and noticing that
the base mutation contains an *empty* row - a row with no cells or
tombstone or anything whatsoever. This case was deemed (8 years ago,
in d5a61a8c48) unexpected and nonsensical,
and we threw an exception. But this case actually *can* happen - here is
how it happened in test_table_alter_delete - which is a test involving
a strange combination of materialized views, LWT and schema changes:

 1. A table has a materialized view, and also a regular column "int_col".
 2. A background thread repeatedly drops and re-creates this column
    int_col.
 3. Another thread deletes rows with LWT ("IF EXISTS").
 4. These LWT operations each reads the existing row, and because of
    repeated drop-and-recreate of the "int_col" column, sometimes this
    read notices that one node has a value for int_col and the other
    doesn't, and creates a read-repair mutation setting int_col (the
    difference between the two reads includes just in this column).
 5. The node missing "int_col" receives this mutation which sets only
    int_col. It upgrade()s this mutation to its most recent schema,
    which doesn't have int_col, so it removes this column from the
    mutation row - and is left with a completely empty mutation row.
    This completely empty row is not useful, but upgrade() doesn't
    remove it.
 6. The view-update generation code sees this empty base-mutation row
    and fails it with this std::logic_error.
 7. The node which sent the read-repair mutation sees that the read
    repair failed, so it fails the read and therefore fails the LWT
    delete operation.
    It is this LWT operation which failed in the test, and caused
    the whole test to fail.

The fix is trivial: an empty base-table row mutation should simply be
*ignored* when generating view updates - it shouldn't cause any error.

Before this patch, test_table_alter_delete used to fail in roughly
20% of the runs on my laptop. After this patch, I ran it 100 times
without a single failure.

Fixes #15228
Fixes #17549

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17607
2024-03-07 12:00:43 +02:00
Kamil Braun
19b816bb68 Merge 'Migrate system_auth to raft group0' from Marcin Maliszkiewicz
This patch series makes all auth writes serialized via raft. Reads stay
eventually consistent for performance reasons. To make transition to new
code easier data is stored in a newly created keyspace: system_auth_v2.

Internally the difference is that instead of executing CQL directly for
writes we generate mutations and then announce them via raft group0. Per
commit descriptions provide more implementation details.

Refs https://github.com/scylladb/scylladb/issues/16970
Fixes https://github.com/scylladb/scylladb/issues/11157

Closes scylladb/scylladb#16578

* github.com:scylladb/scylladb:
  test: extend auth-v2 migration test to catch stale static
  test: add auth-v2 migration test
  test: add auth-v2 snapshot transfer test
  test: auth: add tests for lost quorum and command splitting
  test: pylib: disconnect driver before re-connection
  test: adjust tests for auth-v2
  auth: implement auth-v2 migration
  auth: remove static from queries on auth-v2 path
  auth: coroutinize functions in password_authenticator
  auth: coroutinize functions in standard_role_manager
  auth: coroutinize functions in default_authorizer
  storage_service: add support for auth-v2 raft snapshots
  storage_service: extract getting mutations in raft snapshot to a common function
  auth: service: capture string_view by value
  alternator: add support for auth-v2
  auth: add auth-v2 write paths
  auth: add raft_group0_client as dependency
  cql3: auth: add a way to create mutations without executing
  cql3: run auth DML writes on shard 0 and with raft guard
  service: don't loose service_level_controller when bouncing client_state
  auth: put system_auth and users consts in legacy namespace
  cql3: parametrize keyspace name in auth related statements
  auth: parametrize keyspace name in roles metadata helpers
  auth: parametrize keyspace name in password_authenticator
  auth: parametrize keyspace name in standard_role_manager
  auth: remove redundant consts auth::meta::*::qualified_name
  auth: parametrize keyspace name in default_authorizer
  db: make all system_auth_v2 tables use schema commitlog
  db: add system_auth_v2 tables
  db: add system_auth_v2 keyspace
2024-03-06 10:11:33 +01:00
Dawid Medrek
b36becc1f3 db/hints: Fix too_many_in_flight_hints_for
The semantics of the function was accidentally
modified in 6e79d64. The consequence of the change
was that we didn't limit memory consumption:
the function always returned false for any node
different from the local node. The returned value
is used by storage_proxy to decide whether it
is able to store a hint or not.

This commit fixes the problem by taking other
nodes into consideration again.

Fixes #17636

Closes scylladb/scylladb#17639
2024-03-06 09:48:30 +02:00
Marcin Maliszkiewicz
ebb0ffeb6c auth: implement auth-v2 migration
During raft topology upgrade procedure data from
system_auth keyspace will be migrated to system_auth_v2.

Migration works mostly on top of CQL layer to minimize
amount of new code introduced, it mostly executes SELECTs
on old tables and then INSERTs on new tables. Writes are
not executed as usual but rather announced via raft.
2024-03-01 16:25:14 +01:00
Marcin Maliszkiewicz
5a6d4dbc37 storage_service: add support for auth-v2 raft snapshots
This patch adds new RPC for pulling snapshot of auth tables.
2024-03-01 16:25:14 +01:00
Marcin Maliszkiewicz
d3679de1d2 db: make all system_auth_v2 tables use schema commitlog 2024-03-01 10:40:29 +01:00
Marcin Maliszkiewicz
a706424825 db: add system_auth_v2 tables
Their schema is equivalent to legacy tables
in system_auth.
2024-03-01 10:40:29 +01:00
Marcin Maliszkiewicz
9144d8203b db: add system_auth_v2 keyspace
New keyspace is added similarly as system_schema keyspace,
it's being registred via system_keyspace::make which calls
all_tables to build its schema.

Dummy table 'roles' is added as keyspaces are being currently
registered by walking through their tables. Full table schemas
will be added in subsequent commits.

Change can be observed via cqlsh:

cassandra@cqlsh> describe keyspaces;

system_auth_v2  system_schema       system         system_distributed_everywhere
system_auth     system_distributed  system_traces

cassandra@cqlsh> describe keyspace system_auth_v2;

CREATE KEYSPACE system_auth_v2 WITH replication = {'class': 'LocalStrategy'}  AND durable_writes = true;

CREATE TABLE system_auth_v2.roles (
    role text PRIMARY KEY
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = 'comment'
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 604800
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
2024-03-01 10:40:29 +01:00
Botond Dénes
5e37c1465f db/config: introduce query_page_size_in_bytes
Regulates the page size in bytes via config, instead of the currently
used hard-coded constant. Allows tests to configure lower limits so they
can work with smaller data-sets when testing paging related
functionality.
Not wired yet.
2024-02-27 02:14:45 -05:00
Nadav Har'El
b0233c0833 Merge 'interval: rename nonwrapping_interval to interval' from Avi Kivity
Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token.

We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility.

Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping.

We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias.

Closes scylladb/scylladb#17455

* github.com:scylladb/scylladb:
  interval: rename nonwrapping_interval to interval
  interval: rename interval_test to wrapping_interval_test
2024-02-22 14:03:43 +02:00
Kamil Braun
3d15fecf12 Merge 'amend cluster_status_table virtual table to work with raft' from Gleb
cluster_status_table virtual table have a status field for each node. In
gossiper mode the status is taken from the gossiper, but with raft the
states are different and are stored in the topology state machine. The
series fixes the code to check current mode and take the status from
correct place.

Refs scylladb/scylladb#16984

* 'gleb/cluster_status_table-v1' of github.com:scylladb/scylla-dev:
  gossiper: remove unused REMOVAL_COORDINATOR state
  virtual_tables: take node state from raft for cluster_status_table table if topology over raft is enabled
  virtual_tables: create result for  cluster_status_table read on shard 0
2024-02-22 11:47:57 +01:00
Kamil Braun
3ee56e1936 Merge 'raft topology: enable writes to previous CDC generations' from Patryk Jędrzejczak
When we create a CDC generation and ring-delay is non-zero, the
timestamp of the new generation is in the future. Hence, we can
have multiple generations that can be written to. However, if we
add a new node to the cluster with the Raft-based topology, it
receives only the last committed generation. So, this node will
be rejecting writes considered correct by the other nodes until
the last committed generation starts operating.

In scylladb/scylladb#17134, we have allowed sending writes to the
previous CDC generations. So, the situation became even more
complicated. This PR adjusts the Raft-based topology
to ensure all required generations are loaded into memory and their
data isn't cleared too early.

To load all required generations into memory, we replace
`current_cdc_generation_{uuid, timestamp}` with the set containing
IDs of all committed generations - `committed_cdc_generations`.
To ensure this set doesn't grow endlessly, we remove an entry from
this set together with the data in CDC_GENERATIONS_V3.

Currently, we may clear a CDC generation's data from
CDC_GENERATIONS_V3 if it is not the last committed generation
and it is at least 24 hours old (according to the topology
coordinator's clock). However, after allowing writes to the
previous CDC generations, this condition became incorrect. We
might clear data of a generation that could still be written to.
The new solution introduced in this PR is to clear data of the
generations that finished operating more than 24 hours ago.

Apart from the changes mentioned above, this PR hardens
`test_cdc_generation_clearing.py`.

Fixes scylladb/scylladb#16916
Fixes scylladb/scylladb#17184
Fixes scylladb/scylladb#17288

Closes scylladb/scylladb#17374

* github.com:scylladb/scylladb:
  test: harden test_cdc_generation_clearing
  test: test clean-up of committed_cdc_generations
  raft topology: clean committed_cdc_generations
  raft topology: clean only obsolete CDC generations' data
  storage_service: topology_state_load: load all committed CDC generations
  system_keyspace: load_topology_state: fix indentation
  raft topology: store committed CDC generations' IDs in the topology
2024-02-22 11:41:25 +01:00
Avi Kivity
51df8b9173 interval: rename nonwrapping_interval to interval
Our interval template started life as `range`, and was supported
wrapping to follow Cassandra's convention of wrapping around the
maximum token.

We later recognized that an interval type should usually be non-wrapping
and split it into wrapping_range and nonwrapping_range, with `range`
aliasing wrapping_range to preserve compatibility.

Even later, we realized the name was already taken by C++ ranges and
so renamed it to `interval`. Given that intervals are usually non-wrapping,
the default `interval` type is non-wrapping.

We can now simplify it further, recognizing that everyone assumes
that an interval is non-wrapping and so doesn't need the
nonwrapping_interval_designation. We just rename nonwrapping_interval
to `interval` and remove the type alias.
2024-02-21 19:43:17 +02:00
Tomasz Grabiec
ef9e5e64a3 locator: token_metadata: Introduce topology barrier stall detector
When topology barrier is blocked for longer than configured threshold
(2s), stale versions are marked as stalled and when they get released
they report backtrace to the logs. This should help to identify what
was holding for token metadata pointer for too long.

Example log:

  token_metadata - topology version 30 held for 299.159 [s] past expiry, released at:  0x2397ae1 0x23a36b6 ...

Closes scylladb/scylladb#17427
2024-02-21 15:05:34 +02:00
Avi Kivity
605bf6e221 range.hh: retire
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.

Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.

The unit tests are renamed and range.hh is deleted.

Closes scylladb/scylladb#17428
2024-02-21 00:24:25 +02:00
Avi Kivity
93af3dd69b Merge 'Maintenance socket: set filesystem permissions to 660' from Mikołaj Grzebieluch
Set filesystem permissions for the maintenance socket to 660 (previously it was 755) to allow a scyllaadm's group to connect.
Split the logic of creating sockets into two separate functions, one for each case: when it is a regular cql controller or used by maintenance_socket.

Fixes https://github.com/scylladb/scylladb/issues/16487.

Closes scylladb/scylladb#17113

* github.com:scylladb/scylladb:
  maintenance_socket: add option to set owning group
  transport/controller: get rid of magic number for socket path's maximal length
  transport/controller: set unix_domain_socket_permissions for maintenance_socket
  transport/controller: pass unix_domain_socket_permissions to generic_server::listen
  transport/controller: split configuring sockets into separate functions
2024-02-20 15:09:54 +02:00
Patryk Jędrzejczak
b8aa74f539 raft topology: clean only obsolete CDC generations' data
Currently, we may clear a CDC generation's data from
CDC_GENERATIONS_V3 if it is not the last committed generation
and it is at least 24 hours old (according to the topology
coordinator's clock). However, after allowing writes to the
previous CDC generations, this condition became incorrect. We
might clear data of a generation that could still be written to.

The new solution is to clear data of the generations that
finished operating more than 24 hours ago. The rationale behind
it is in the new comment in
`topology_coordinator:clean_obsolete_cdc_generations`.

The previous solution used the clean-up candidate. After
introducing `committed_cdc_generations`, it became unneeded.
The last obsolete generation can be computed in
`topology_coordinator:clean_obsolete_cdc_generations`. Therefore,
we remove all the code that handles the clean-up candidate.

After changing how we clear CDC generations' data,
`test_current_cdc_generation_is_not_removed` became obsolete.
The tested feature is not present in the code anymore.

`test_dependency_on_timestamps` became the only test case covering
the CDC generation's data clearing. We adjust it after the changes.
2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak
18cff1aa6a system_keyspace: load_topology_state: fix indentation
Broken in the previous patch.
2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak
e145e758eb raft topology: store committed CDC generations' IDs in the topology
When we create a CDC generation and ring-delay is non-zero, the
timestamp of the new generation is in the future. Hence, we can
have multiple generations that can be written to. However, if we
add a new node to the cluster with the Raft-based topology, it
receives only the last committed generation. So, this node will
be rejecting writes considered correct by the other nodes until
the last committed generation starts operating.

In scylladb/scylladb#17134, we have allowed sending writes to the
previous CDC generations. So, the situation became even more
complicated. We need to adjust the Raft-based topology to ensure
all required generations are loaded into memory and their data
isn't cleared too early.

This patch is the first step of the adjustment. We replace
`current_cdc_generation_{uuid, timestamp}` with the set containing
IDs of all committed generations - `committed_cdc_generations`.
This set is sorted by timestamps, just like
`unpublished_cdc_generations`.

This patch is mostly refactoring. The last generation in
`committed_cdc_generations` is the equivalent of the previous
`current_cdc_generation_{uuid, timestamp}`. The other generations
are irrelevant for now. They will be used in the following patches.

After introducing `committed_cdc_generations`, a newly committed
generation is also unpublished (it was current and unpublished
before the patch). We introduce `add_new_committed_cdc_generation`,
which updates both sets of generations so that we don't have to
call `add_committed_cdc_generation` and
`add_unpublished_cdc_generation` together. It's easy to forget
that both of them are necessary. Before this patch, there was
no call to `add_unpublished_cdc_generation` in
`topology_coordinator::build_coordinator_state`. It was a bug
reported in scylladb/scylladb#17288. This patch fixes it.

This patch also removes "the current generation" notion from the
Raft-based topology. For the Raft-based topology, the current
generation was the last committed generation. However, for the
`cdc::metadata`, it was the generation operating now. These two
generations could be different, which was confusing. For the
`cdc::metadata`, the current generation is relevant as it is
handled differently, but for the Raft-based topology, it isn't.
Therefore, we change only the Raft-based topology. The generation
called "current" is called "the last committed" from now.
2024-02-20 12:35:16 +01:00
Gleb Natapov
461bba08cb virtual_tables: take node state from raft for cluster_status_table table if topology over raft is enabled
If topology over raft is enabled the most up-to-date node status is in
the topology state machine. Get it from there.
2024-02-19 15:01:33 +02:00
Gleb Natapov
eb6fa81714 virtual_tables: create result for cluster_status_table read on shard 0
Next patch will access data that is available only on shard 0 during
result creation.
2024-02-19 15:01:33 +02:00
Mikołaj Grzebieluch
182cfebe40 maintenance_socket: add option to set owning group
Option `maintenance-socket-group` sets the owning group of the maintenance socket.
If not set, the group will be the same as the user running the scylla node.
2024-02-19 10:21:00 +01:00
Kefu Chai
50964c423e hints: host_filter: add formatter for hints::host_filter
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `hints::host_filter`. its
operator<< is preserved as it's still used by the homebrew generic
formatter for vector<>, which is in turn used by db/config.cc.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17347
2024-02-16 19:03:11 +03:00