scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-05 22:43:15 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	21a5911e60	Merge 'db/virtual_tables: make token_ring_table tablet aware' from Botond Dénes The token ring table is a virtual table (`system.token_ring`), which contains the ring information for all keyspaces in the system. This is essentially an alternative to `nodetool describering`, but since it is a virtual table, it allows for all the usual filtering/aggregation/etc. that CQL supports. Up until now, this table only supported keyspaces which use vnodes. This PR adds support for tablet keyspaces. To accommodate these keyspaces a new `table_name` column is added, which is set to `ALL` for vnodes keyspaces. For tablet keyspaces, this contains the name of the table. Simple sanity tests are added for this virtual table (it had none). Fixes: #16850 Closes scylladb/scylladb#17351 * github.com:scylladb/scylladb: test/cql-pytest: test_virtual_tables: add test for token_ring table db/virtual_tables: token_ring_table: add tablet support db/virtual_tables: token_ring_table: add table_name column db/virtual_tables: token_ring_table: extract ring emit service/storage_service: describe_ring_for_table(): use topology to map hostid to ip	2024-03-20 14:05:49 +03:00
Avi Kivity	e48eb76f61	sstables_manager: decouple from system_keyspace sstables_manager now depends on system_keyspace for access to the system.sstables table, needed by object storage. This violates modularity, since sstables_manager is a relatively low-level leaf module while system_keyspace integrates large parts of the system (including, indirectly, sstables_manager). One area where this is grating is sstables::test_env, which has to include the much higher level cql_test_env to accommodate it. Fix this by having sstables_manager expose its dependency on system_keyspace as an interface, sstables_registry, and have system_keyspace implement the glue logic in system_keyspace_sstables_manager. Closes scylladb/scylladb#17868	2024-03-18 20:38:07 +03:00
Avi Kivity	72bbe75d5b	Merge 'Fix node replace with tablets for RF=N' from Tomasz Grabiec This PR fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because tablet replica allocation for rebuild will not be able to find a viable destination, as the replacing node is not considered to be a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes https://github.com/scylladb/scylladb/issues/17025 Nodes in the left state will be kept in tablet replica sets for a while after node replace is done, until the new replica is rebuilt. So we need to know about those node's location (dc, rack) for two reasons: 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first. 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement. It's ok to not know the IP, and we don't keep it. Those nodes will not be present in the IP-based replica sets, e.g. those returned by get_natural_endpoints(), only in host_id-based replica sets. storage_proxy request coordination is not affected. Nodes in the left state are still not present in token ring, and not considered to be members of the ring (datacanter endpoints excludes them). In the future we could make the change even more transparent by only loading locator::node* for those nodes and keeping node* in tablet replica sets. Currently left nodes are never removed from topology, so will accumulate in memory. We could garbage-collect them from topology coordinator if a left node is absent in any replica set. That means we need a new state - left_for_real. Closes scylladb/scylladb#17388 * github.com:scylladb/scylladb: test: py: Add test for view replica pairing after replace raft, api: Add RESTful API to query current leader of a raft group test: test_tablets_removenode: Verify replacing when there is no spare node doc: topology-on-raft: Document replace behavior with tablets tablets, raft topology: Rebuild tablets after replacing node is normal tablets: load_balancer: Access node attributes via node struct tablets: load_balancer: Extract ensure_node() mv: Switch to using host_id-based replica set effective_replication_map: Introduce host_id-based get_replicas() raft topology: Keep nodes in the left state to topology tablets: Introduce read_required_hosts()	2024-03-18 16:16:08 +02:00
Wojciech Mitros	efcb718e0a	mv: adjust memory tracking of single view updates within a batch Currently, when dividing memory tracked for a batch of updates we do not take into account the overhead that we have for processing every update. This patch adds the overhead for single updates and joins the memory calculation path for batches and their parts so that both use the same overhead. Fixes #17854 Closes scylladb/scylladb#17855	2024-03-18 14:31:54 +02:00
Avi Kivity	731b5c5120	schema_tables: unfreeze frozen_mutation:s gently With large schemas, unfreezing can stall, especially as it requires a lot of memory. Switch to a gentle version that will not stall.	2024-03-17 17:46:02 +02:00
Kefu Chai	8bab51733f	db: add fmt::formatter for db::functions::function before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `db::functions::function`. please note, because we use `std::ostream` as the parameter of the polymorphism implementation of `function::print()`. without an intrusive change, we have to use `fmt::ostream_formatter` or at least use similar technique to format the `function` instance into an instance of `ostream` first. so instead of implementing a "native" `fmt::formatter`, in this change, we just use `fmt::ostream_formatter`. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17832	2024-03-16 17:36:49 +02:00
Tomasz Grabiec	9b656ec2aa	mv: Switch to using host_id-based replica set This is necessary to not break replica pairing between base and view. After replacing a node, tablet replica set contains for a while the replaced node which is in the left state. This node is not returned by the IP-based get_natural_endpoints() so the replica indexes would shift, changing the pairing with the view. The host_id-based replica set always has stable indexes for replicas.	2024-03-15 11:05:29 +01:00
Tomasz Grabiec	61b3453552	raft topology: Keep nodes in the left state to topology Those nodes will be kept in tablet replica sets for a while after node replace is done, until the new replica is rebuilt. So we need to know about those node's location (dc, rack) for two reasons: 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first. 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement. It's ok to not know the IP, and we don't keep it. Those nodes will not be present in the IP-based replica sets, e.g. those returned by get_natural_endpoints(), only in host_id-based replica sets. storage_proxy request coordination is not affected. Nodes in the left state are still not present in token ring, and not considered to be members of the ring (datacanter endpoints excludes them). In the future we could make the change even more transparent by only loading locator::node* for those nodes and keeping node* in tablet replica sets. We load topology infromation only for left nodes which are actually referenced by any tablet. To achieve that, topology loading code queries system.tablet for the set of hosts. This set is then passed to system.topology loading method which decides whether to load replica_state for a left node or not.	2024-03-15 11:05:29 +01:00
Botond Dénes	279e496133	db/virtual_tables: token_ring_table: add tablet support For keyspaces which use tablets, we describe each table separately.	2024-03-15 04:23:20 -04:00
Botond Dénes	61b6ac7ffe	db/virtual_tables: token_ring_table: add table_name column As the first clustering column. For vnode keyspaces, this will always be "ALL", for tablet keyspaces, this will contain the name of the described table.	2024-03-15 04:23:20 -04:00
Botond Dénes	fdef62c232	db/virtual_tables: token_ring_table: extract ring emit Into a separate method. For vnodes there is a single ring per keyspace, but for tablets, there is a separate ring for each table in the keyspace. To accomodate both, we move the code emitting the ring into a separate method, so execute() can just call it once per keyspace or once per table, whichever appropriate.	2024-03-15 04:23:20 -04:00
Tomasz Grabiec	8c5d088928	Merge 'Drop tablets of dropped views and indices' from Benny Halevy This series adds notification before dropping views and indices so that the tablet_allocator can generate mutations to respectively drop all tablets associated with them from system.tablets. Additional unit tests were added for these cases. Note that one case is not yet tested: where a table is allowed to be dropped while having views that depend on it, when it is dropped from the alternator path. This is tested indirectly by testing dropping a table with live secondary index as it follows the same notification path as views in this series. Fixes #17627 Closes scylladb/scylladb#17773 * github.com:scylladb/scylladb: migration_manager: notify before_drop_column_family when dropping indices schema_tables: make_update_indices_mutations: use find_schema to lookup the view of dropped indices migration_manager: notify before_drop_column_family before dropping views cql-pytest: test_tablets: add test_tablets_are_dropped_when_dropping_table tablet_allocator: on_before_drop_column_family: remove unused result variable	2024-03-14 22:52:29 +01:00
Benny Halevy	5bfca73b30	migration_manager: notify before_drop_column_family when dropping indices Fixes #17627 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-14 20:19:12 +02:00
Benny Halevy	9cf6a2e510	schema_tables: make_update_indices_mutations: use find_schema to lookup the view of dropped indices When dropping indices, we don't need to go through `create_view_for_index` in order to drop the index. That actually creates a new schema for this view which is used just for its metadata for generating mutations dropping it. Instead, use `find_schema` to lookup the current schema for the dropped index. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-14 20:19:11 +02:00
Piotr Smaron	ad2d039e3d	db: move all group 0 tables to schema commitlog This is to have durability for the group0 tables. But also because I need it specifially to make `system.topology` & `system_schema.scylla_keyspaces` mutations under a single raft command in https://github.com/scylladb/scylladb/pull/16723 Fixes: #15596 Closes scylladb/scylladb#17783	2024-03-14 13:33:30 +01:00
Kefu Chai	926fe29ebd	db: commitlog: add fmt::formatter for commitlog types before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for * db::commitlog::segment::cf_mark * db::commitlog::segment_manager::named_file * db::commitlog::segment_manager::dispose_mode * db::commitlog::segment_manager::byte_flow<T> please note, the formatter of `db::commitlog::segment` is not included in this commit, as we are formatting it in the inline definition of this class. so we cannot define the specialization of `fmt::formatter` for this class before its callers -- we'd either use `format_as()` provided by {fmt} v10, or use `fmt::streamed`. either way, it's different from the theme of this commit, and we will handle it in a separated commit. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17792	2024-03-14 09:28:12 +02:00
Pavel Emelyanov	3a734facc7	view_builder: Complete build step early if reader produces nothing Builder works in "steps". Each step runs for a given base table, when a new view is created it either initiates a step or appends to currently running step. Running a step means reading mutations from local sstables reader and applying them to all views that has jumped into this step so far. When a view is added to the step it remembers the current token value the step is on. When step receives end-of-stream it rewinds to minimal-token. Rewinding is done by closing current reader and creating a new one. Each time token is advanced, all the views that meet the new token value for the second time (i.e. -- scan full round) are marked as built and are removed from step. When no views are left on step, it finishes. The above machinery can break when rewinding the end-of-stream reader. The trick is that a running step silently assumes that if the reader once produced some token (and there can be a view that remembered this token as its starting one), then after rewinding the reader would generate the same token or greater. With tablets, however, that's not the case. When a node is decommissioned tablets are cleaned and all sstables are removed. Rewinding a reader after it makes empty reader that produces no tokens from now on. Respectively, any build steps that had captured tokens prior to cleanup would get stuck forever. The fix is to check if the mutation consumer stepped at least one step forward after rewind, and if no -- complete all the attached views. fixes: #17293 Similar thing should happen if the base table is truncated with views being built from it. Testing it steps on compaction assertion elsewhere and needs more research. refs: #17543 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#17548	2024-03-12 14:58:47 +02:00
Nadav Har'El	19bcea6216	materialized views: fix rare failure caused by empty update This one-line patch fixes a failure in the dtest lwt_schema_modification_test.py::TestLWTSchemaModification ::test_table_alter_delete Where an update sometimes failed due to an internal server error, and the log had the mysterious warning message: "std::logic_error (Empty materialized view updated)" We've also seen this log-message in the past in another user's log, and never understood what it meant. It turns out that the error message was generated (and warning printed) while building view updates for a base-table mutation, and noticing that the base mutation contains an empty row - a row with no cells or tombstone or anything whatsoever. This case was deemed (8 years ago, in `d5a61a8c48`) unexpected and nonsensical, and we threw an exception. But this case actually can happen - here is how it happened in test_table_alter_delete - which is a test involving a strange combination of materialized views, LWT and schema changes: 1. A table has a materialized view, and also a regular column "int_col". 2. A background thread repeatedly drops and re-creates this column int_col. 3. Another thread deletes rows with LWT ("IF EXISTS"). 4. These LWT operations each reads the existing row, and because of repeated drop-and-recreate of the "int_col" column, sometimes this read notices that one node has a value for int_col and the other doesn't, and creates a read-repair mutation setting int_col (the difference between the two reads includes just in this column). 5. The node missing "int_col" receives this mutation which sets only int_col. It upgrade()s this mutation to its most recent schema, which doesn't have int_col, so it removes this column from the mutation row - and is left with a completely empty mutation row. This completely empty row is not useful, but upgrade() doesn't remove it. 6. The view-update generation code sees this empty base-mutation row and fails it with this std::logic_error. 7. The node which sent the read-repair mutation sees that the read repair failed, so it fails the read and therefore fails the LWT delete operation. It is this LWT operation which failed in the test, and caused the whole test to fail. The fix is trivial: an empty base-table row mutation should simply be ignored when generating view updates - it shouldn't cause any error. Before this patch, test_table_alter_delete used to fail in roughly 20% of the runs on my laptop. After this patch, I ran it 100 times without a single failure. Fixes #15228 Fixes #17549 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17607	2024-03-07 12:00:43 +02:00
Kamil Braun	19b816bb68	Merge 'Migrate system_auth to raft group0' from Marcin Maliszkiewicz This patch series makes all auth writes serialized via raft. Reads stay eventually consistent for performance reasons. To make transition to new code easier data is stored in a newly created keyspace: system_auth_v2. Internally the difference is that instead of executing CQL directly for writes we generate mutations and then announce them via raft group0. Per commit descriptions provide more implementation details. Refs https://github.com/scylladb/scylladb/issues/16970 Fixes https://github.com/scylladb/scylladb/issues/11157 Closes scylladb/scylladb#16578 * github.com:scylladb/scylladb: test: extend auth-v2 migration test to catch stale static test: add auth-v2 migration test test: add auth-v2 snapshot transfer test test: auth: add tests for lost quorum and command splitting test: pylib: disconnect driver before re-connection test: adjust tests for auth-v2 auth: implement auth-v2 migration auth: remove static from queries on auth-v2 path auth: coroutinize functions in password_authenticator auth: coroutinize functions in standard_role_manager auth: coroutinize functions in default_authorizer storage_service: add support for auth-v2 raft snapshots storage_service: extract getting mutations in raft snapshot to a common function auth: service: capture string_view by value alternator: add support for auth-v2 auth: add auth-v2 write paths auth: add raft_group0_client as dependency cql3: auth: add a way to create mutations without executing cql3: run auth DML writes on shard 0 and with raft guard service: don't loose service_level_controller when bouncing client_state auth: put system_auth and users consts in legacy namespace cql3: parametrize keyspace name in auth related statements auth: parametrize keyspace name in roles metadata helpers auth: parametrize keyspace name in password_authenticator auth: parametrize keyspace name in standard_role_manager auth: remove redundant consts auth::meta::*::qualified_name auth: parametrize keyspace name in default_authorizer db: make all system_auth_v2 tables use schema commitlog db: add system_auth_v2 tables db: add system_auth_v2 keyspace	2024-03-06 10:11:33 +01:00
Dawid Medrek	b36becc1f3	db/hints: Fix too_many_in_flight_hints_for The semantics of the function was accidentally modified in `6e79d64`. The consequence of the change was that we didn't limit memory consumption: the function always returned false for any node different from the local node. The returned value is used by storage_proxy to decide whether it is able to store a hint or not. This commit fixes the problem by taking other nodes into consideration again. Fixes #17636 Closes scylladb/scylladb#17639	2024-03-06 09:48:30 +02:00
Marcin Maliszkiewicz	ebb0ffeb6c	auth: implement auth-v2 migration During raft topology upgrade procedure data from system_auth keyspace will be migrated to system_auth_v2. Migration works mostly on top of CQL layer to minimize amount of new code introduced, it mostly executes SELECTs on old tables and then INSERTs on new tables. Writes are not executed as usual but rather announced via raft.	2024-03-01 16:25:14 +01:00
Marcin Maliszkiewicz	5a6d4dbc37	storage_service: add support for auth-v2 raft snapshots This patch adds new RPC for pulling snapshot of auth tables.	2024-03-01 16:25:14 +01:00
Marcin Maliszkiewicz	d3679de1d2	db: make all system_auth_v2 tables use schema commitlog	2024-03-01 10:40:29 +01:00
Marcin Maliszkiewicz	a706424825	db: add system_auth_v2 tables Their schema is equivalent to legacy tables in system_auth.	2024-03-01 10:40:29 +01:00
Marcin Maliszkiewicz	9144d8203b	db: add system_auth_v2 keyspace New keyspace is added similarly as system_schema keyspace, it's being registred via system_keyspace::make which calls all_tables to build its schema. Dummy table 'roles' is added as keyspaces are being currently registered by walking through their tables. Full table schemas will be added in subsequent commits. Change can be observed via cqlsh: cassandra@cqlsh> describe keyspaces; system_auth_v2 system_schema system system_distributed_everywhere system_auth system_distributed system_traces cassandra@cqlsh> describe keyspace system_auth_v2; CREATE KEYSPACE system_auth_v2 WITH replication = {'class': 'LocalStrategy'} AND durable_writes = true; CREATE TABLE system_auth_v2.roles ( role text PRIMARY KEY ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = 'comment' AND compaction = {'class': 'SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 604800 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE';	2024-03-01 10:40:29 +01:00
Botond Dénes	5e37c1465f	db/config: introduce query_page_size_in_bytes Regulates the page size in bytes via config, instead of the currently used hard-coded constant. Allows tests to configure lower limits so they can work with smaller data-sets when testing paging related functionality. Not wired yet.	2024-02-27 02:14:45 -05:00
Nadav Har'El	b0233c0833	Merge 'interval: rename nonwrapping_interval to interval' from Avi Kivity Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token. We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility. Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping. We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias. Closes scylladb/scylladb#17455 * github.com:scylladb/scylladb: interval: rename nonwrapping_interval to interval interval: rename interval_test to wrapping_interval_test	2024-02-22 14:03:43 +02:00
Kamil Braun	3d15fecf12	Merge 'amend cluster_status_table virtual table to work with raft' from Gleb cluster_status_table virtual table have a status field for each node. In gossiper mode the status is taken from the gossiper, but with raft the states are different and are stored in the topology state machine. The series fixes the code to check current mode and take the status from correct place. Refs scylladb/scylladb#16984 * 'gleb/cluster_status_table-v1' of github.com:scylladb/scylla-dev: gossiper: remove unused REMOVAL_COORDINATOR state virtual_tables: take node state from raft for cluster_status_table table if topology over raft is enabled virtual_tables: create result for cluster_status_table read on shard 0	2024-02-22 11:47:57 +01:00
Kamil Braun	3ee56e1936	Merge 'raft topology: enable writes to previous CDC generations' from Patryk Jędrzejczak When we create a CDC generation and ring-delay is non-zero, the timestamp of the new generation is in the future. Hence, we can have multiple generations that can be written to. However, if we add a new node to the cluster with the Raft-based topology, it receives only the last committed generation. So, this node will be rejecting writes considered correct by the other nodes until the last committed generation starts operating. In scylladb/scylladb#17134, we have allowed sending writes to the previous CDC generations. So, the situation became even more complicated. This PR adjusts the Raft-based topology to ensure all required generations are loaded into memory and their data isn't cleared too early. To load all required generations into memory, we replace `current_cdc_generation_{uuid, timestamp}` with the set containing IDs of all committed generations - `committed_cdc_generations`. To ensure this set doesn't grow endlessly, we remove an entry from this set together with the data in CDC_GENERATIONS_V3. Currently, we may clear a CDC generation's data from CDC_GENERATIONS_V3 if it is not the last committed generation and it is at least 24 hours old (according to the topology coordinator's clock). However, after allowing writes to the previous CDC generations, this condition became incorrect. We might clear data of a generation that could still be written to. The new solution introduced in this PR is to clear data of the generations that finished operating more than 24 hours ago. Apart from the changes mentioned above, this PR hardens `test_cdc_generation_clearing.py`. Fixes scylladb/scylladb#16916 Fixes scylladb/scylladb#17184 Fixes scylladb/scylladb#17288 Closes scylladb/scylladb#17374 * github.com:scylladb/scylladb: test: harden test_cdc_generation_clearing test: test clean-up of committed_cdc_generations raft topology: clean committed_cdc_generations raft topology: clean only obsolete CDC generations' data storage_service: topology_state_load: load all committed CDC generations system_keyspace: load_topology_state: fix indentation raft topology: store committed CDC generations' IDs in the topology	2024-02-22 11:41:25 +01:00
Avi Kivity	51df8b9173	interval: rename nonwrapping_interval to interval Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token. We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility. Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping. We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias.	2024-02-21 19:43:17 +02:00
Tomasz Grabiec	ef9e5e64a3	locator: token_metadata: Introduce topology barrier stall detector When topology barrier is blocked for longer than configured threshold (2s), stale versions are marked as stalled and when they get released they report backtrace to the logs. This should help to identify what was holding for token metadata pointer for too long. Example log: token_metadata - topology version 30 held for 299.159 [s] past expiry, released at: 0x2397ae1 0x23a36b6 ... Closes scylladb/scylladb#17427	2024-02-21 15:05:34 +02:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Avi Kivity	93af3dd69b	Merge 'Maintenance socket: set filesystem permissions to 660' from Mikołaj Grzebieluch Set filesystem permissions for the maintenance socket to 660 (previously it was 755) to allow a scyllaadm's group to connect. Split the logic of creating sockets into two separate functions, one for each case: when it is a regular cql controller or used by maintenance_socket. Fixes https://github.com/scylladb/scylladb/issues/16487. Closes scylladb/scylladb#17113 * github.com:scylladb/scylladb: maintenance_socket: add option to set owning group transport/controller: get rid of magic number for socket path's maximal length transport/controller: set unix_domain_socket_permissions for maintenance_socket transport/controller: pass unix_domain_socket_permissions to generic_server::listen transport/controller: split configuring sockets into separate functions	2024-02-20 15:09:54 +02:00
Patryk Jędrzejczak	b8aa74f539	raft topology: clean only obsolete CDC generations' data Currently, we may clear a CDC generation's data from CDC_GENERATIONS_V3 if it is not the last committed generation and it is at least 24 hours old (according to the topology coordinator's clock). However, after allowing writes to the previous CDC generations, this condition became incorrect. We might clear data of a generation that could still be written to. The new solution is to clear data of the generations that finished operating more than 24 hours ago. The rationale behind it is in the new comment in `topology_coordinator:clean_obsolete_cdc_generations`. The previous solution used the clean-up candidate. After introducing `committed_cdc_generations`, it became unneeded. The last obsolete generation can be computed in `topology_coordinator:clean_obsolete_cdc_generations`. Therefore, we remove all the code that handles the clean-up candidate. After changing how we clear CDC generations' data, `test_current_cdc_generation_is_not_removed` became obsolete. The tested feature is not present in the code anymore. `test_dependency_on_timestamps` became the only test case covering the CDC generation's data clearing. We adjust it after the changes.	2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak	18cff1aa6a	system_keyspace: load_topology_state: fix indentation Broken in the previous patch.	2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak	e145e758eb	raft topology: store committed CDC generations' IDs in the topology When we create a CDC generation and ring-delay is non-zero, the timestamp of the new generation is in the future. Hence, we can have multiple generations that can be written to. However, if we add a new node to the cluster with the Raft-based topology, it receives only the last committed generation. So, this node will be rejecting writes considered correct by the other nodes until the last committed generation starts operating. In scylladb/scylladb#17134, we have allowed sending writes to the previous CDC generations. So, the situation became even more complicated. We need to adjust the Raft-based topology to ensure all required generations are loaded into memory and their data isn't cleared too early. This patch is the first step of the adjustment. We replace `current_cdc_generation_{uuid, timestamp}` with the set containing IDs of all committed generations - `committed_cdc_generations`. This set is sorted by timestamps, just like `unpublished_cdc_generations`. This patch is mostly refactoring. The last generation in `committed_cdc_generations` is the equivalent of the previous `current_cdc_generation_{uuid, timestamp}`. The other generations are irrelevant for now. They will be used in the following patches. After introducing `committed_cdc_generations`, a newly committed generation is also unpublished (it was current and unpublished before the patch). We introduce `add_new_committed_cdc_generation`, which updates both sets of generations so that we don't have to call `add_committed_cdc_generation` and `add_unpublished_cdc_generation` together. It's easy to forget that both of them are necessary. Before this patch, there was no call to `add_unpublished_cdc_generation` in `topology_coordinator::build_coordinator_state`. It was a bug reported in scylladb/scylladb#17288. This patch fixes it. This patch also removes "the current generation" notion from the Raft-based topology. For the Raft-based topology, the current generation was the last committed generation. However, for the `cdc::metadata`, it was the generation operating now. These two generations could be different, which was confusing. For the `cdc::metadata`, the current generation is relevant as it is handled differently, but for the Raft-based topology, it isn't. Therefore, we change only the Raft-based topology. The generation called "current" is called "the last committed" from now.	2024-02-20 12:35:16 +01:00
Gleb Natapov	461bba08cb	virtual_tables: take node state from raft for cluster_status_table table if topology over raft is enabled If topology over raft is enabled the most up-to-date node status is in the topology state machine. Get it from there.	2024-02-19 15:01:33 +02:00
Gleb Natapov	eb6fa81714	virtual_tables: create result for cluster_status_table read on shard 0 Next patch will access data that is available only on shard 0 during result creation.	2024-02-19 15:01:33 +02:00
Mikołaj Grzebieluch	182cfebe40	maintenance_socket: add option to set owning group Option `maintenance-socket-group` sets the owning group of the maintenance socket. If not set, the group will be the same as the user running the scylla node.	2024-02-19 10:21:00 +01:00
Kefu Chai	50964c423e	hints: host_filter: add formatter for hints::host_filter before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `hints::host_filter`. its operator<< is preserved as it's still used by the homebrew generic formatter for vector<>, which is in turn used by db/config.cc. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17347	2024-02-16 19:03:11 +03:00
Kamil Braun	50ebce8acc	Merge 'Purge old ip on change' from Petr Gusev When a node changes IP address we need to remove its old IP from `system.peers` and gossiper. We do this in `sync_raft_topology_nodes` when the new IP is saved into `system.peers` to avoid losing the mapping if the node crashes between deleting and saving the new IP. We also handle the possible duplicates in this case by dropping them on the read path when the node is restarted. The PR also fixes the problem with old IPs getting resurrected when a node changes its IP address. The following scenario is possible: a node `A` changes its IP from `ip1` to `ip2` with restart, other nodes are not yet aware of `ip2` so they keep gossiping `ip1`. After restart `A` receives `ip1` in a gossip message and calls `handle_major_state_change` since it considers it as a new node. Then `on_join` event is called on the gossiper notification handlers, we receive such event in `raft_ip_address_updater` and reverts the IP of the node A back to ip1. To fix this we ensure that the new gossiper generation number is used when a node registers its IP address in `raft_address_map` at startup. The `test_change_ip` is adjusted to ensure that the old IPs are properly removed in all cases, even if the node crashes. Fixes #16886 Fixes #16691 Fixes #17199 Closes scylladb/scylladb#17162 * github.com:scylladb/scylladb: test_change_ip: improve the test raft_ip_address_updater: remove stale IPs from gossiper raft_address_map: add my ip with the new generation system_keyspace::update_peer_info: check ep and host_id are not empty system_keyspace::update_peer_info: make host_id an explicit parameter system_keyspace::update_peer_info: remove any_set flag optimisation system_keyspace: remove duplicate ips for host_id system_keyspace: peers table: use coroutines storage_service::raft_ip_address_updater: log gossiper event name raft topology: ip change: purge old IP on_endpoint_change: coroutinize the lambda around sync_raft_topology_nodes	2024-02-15 17:40:29 +01:00
Avi Kivity	5df5714331	Merge 'api: storage_service/natural_endpoints: add tablets support' from Botond Dénes This API endpoint currently returns with status 500 if attempted to be called for a table which uses tablets. This series adds tablet support. No change in usage semantics is required, the endpoint already has a table parameter. This endpoint is the backend of `nodetool getendpoints` which should now work, after this PR. Fixes: #17313 Closes scylladb/scylladb#17316 * github.com:scylladb/scylladb: service/storage_service: get_natural_endpoints(): add tablets support replica/database: keyspace: add uses_tablets() service/storage_service: remove token overload of get_natural_endpoints()	2024-02-15 13:36:56 +02:00
Petr Gusev	2bf75c1a4e	system_keyspace::update_peer_info: check ep and host_id are not empty	2024-02-15 13:21:04 +04:00
Petr Gusev	86410d71d1	system_keyspace::update_peer_info: make host_id an explicit parameter The host_id field should always be set, so it's more appropriate to pass it as a separate parameter. The function storage_service::get_peer_info_for_update is updated. It shouldn't look for host_id app state is the passed map, instead the callers should get the host_id on their own.	2024-02-15 13:21:04 +04:00
Petr Gusev	e0072f7cb3	system_keyspace::update_peer_info: remove any_set flag optimisation This optimization never worked -- there were four usages of the update_peer_info function and in all of them some of the peer_info fields were set or should be set: * sync_raft_topology_nodes/process_normal_node: e.g. tokens is set * sync_raft_topology_nodes/process_transition_node: host_id is set * handle_state_normal: tokens is set * storage_service::on_change: get_peer_info_for_update could potentially return a peer_info with all fields set to empty, but this shouldn't be possible, host_id should always be set. Moreover, there is a bug here: we extract host_id from the states_ parameter, which represent the gossiper application states that have been changed. This parameter contains host_id only if a node changes its IP address, in all other cases host_id is unset. This means we could end up with a record with empty host_id, if it wasn't previously set by some other means. We are going to fix this bug in the next commit.	2024-02-15 13:21:04 +04:00
Petr Gusev	4a14988735	system_keyspace: remove duplicate ips for host_id When a node changes IP we call sync_raft_topology_nodes from raft_ip_address_updater::on_endpoint_change with the old IP value in prev_ip parameter. It's possible that the nodes crashes right after we insert a new IP for the host_id, but before we remove the old IP. In this commit we fix the possible inconsistency by removing the system.peers record with old timestamp. This is what the new peers_table_read_fixup function is responsible for. We call this function in all system_keyspace methods that read the system.peers table. The function loads the table in memory, decides if some rows are stale by comparing their timestamps and removes them. The new function also removes the records with no host_id, so we no longer need the get_host_id function. We'll add a test for the problem this commit fixes in the next commit.	2024-02-15 13:21:04 +04:00
Petr Gusev	fa8718085a	system_keyspace: peers table: use coroutines This is a refactoring commit with no observable changes in behaviour. We switch the functions to coroutines, it'll be easy to work with them in this way in the next commit. Also, we add more const-s along the way.	2024-02-15 13:21:04 +04:00
Botond Dénes	7f17d3bb0e	replica/database: keyspace: add uses_tablets() Mirroring table::uses_tablets(), provides a convenient and -- more importabtly -- easily discoverable way to determine whether the keyspace uses tablets or not. This information is of course already available via the abstract replication strategy, but as seen in a few examples, this is not easily discoverable and sometimes people resorted to enumerating the keyspace's tables to be able to invoke table::uses_tablets().	2024-02-15 01:51:26 -05:00
Kamil Braun	7e9e10186f	Merge 'change the way ignored nodes are handled by the topology coordinator' from Gleb This series makes several changes to how ignored nodes list is treated by the topology coordinator. First the series makes it global and not part of a single topology operation, second it extends the list at the time of removenode/replace invocation and third it bans all nodes in the list from contacting the cluster ever again. The main motivation is to have a way to unblock tablet migration in case of a node failure. Tablet migration knows how to avoid nodes in ignored nodes list and this patch series provides a way to extend it without performing any topology operation (which is not possible while tables migration runs). Fixes scylladb/scylladb#16108 * 'gleb/ignore-nodes-handling-v2' of github.com:scylladb/scylla-dev: test: add test for the new ignore nodes behaviour topology coordinator: cleanup node_state::decommissioning state handling code topology coordinator: ban ignored nodes just like we ban nodes that are left storage_service: topology coordinator: validate ignore dead nodes parameters in removenode/replace topology coordinator: add removed/replaced nodes to ignored_nodes list at the request invocation time topology coordinator: make ignored_nodes list global and permanent topology_coordinator: do not cancel rebuild just because some other nodes are dead topology coordinator: throw more specific error from wait_for_ip() function in case of a timeout raft_group0: add make_nonvoters function that can make multiple node non voters simultaneously	2024-02-14 16:36:01 +01:00
Gleb Natapov	9b52dc4560	topology coordinator: make ignored_nodes list global and permanent Currently ignored_nodes list is part of a request (removenode or replace) and exists only while a request is handled. This patch changes it to be global and exist outside of any request. Node stays in the list until they eventually removed and moved to the "left" state. If a node is specified in the ignore-dead-nodes option for any command it will be ignored for all other operations that support ignored_nodes (like tablet migration).	2024-02-13 16:15:35 +02:00

1 2 3 4 5 ...

3645 Commits