scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-28 10:41:12 +00:00

Author	SHA1	Message	Date
Botond Dénes	562087beff	Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai" This reverts commit `d1dc579062`, reversing changes made to `3a73048bc9`. Said commit caused regressions in dtests. We need to investigate and fix those, but in the meanwhile let's revert this to reduce the disruption to our workflows. Refs: #14283	2023-06-19 08:49:27 +03:00
Kamil Braun	028183c793	main, cql_test_env: simplify `system_keyspace` initialization Initialization of `system_keyspace` is now all done at once instead of being spread out through the entire procedure. This is doable because `query_processor` is now available early. A couple of FIXMEs have been resolved.	2023-06-18 13:39:27 +02:00
Kamil Braun	33c19baabc	db: system_keyspace: take simpler service references in `make` Take references to services which are initialized earlier. The references to `gossiper`, `storage_service` and `raft_group0_registry` are no longer needed. This will allow us to move the `make` step right after starting `system_keyspace`.	2023-06-18 13:39:27 +02:00
Kamil Braun	b34605d161	db: system_keyspace: call `initialize_virtual_tables` from `main` `initialize_virtual_tables` was called from `system_keyspace::make`, which caused this `make` function to take a bunch of references to late-initialized services (`gossiper`, `storage_service`). Call it from `main`/`cql_test_env` instead. Note: `system_keyspace::make` is called from `distributed_loader::init_system_keyspace`. The latter function contains additional steps: populate the system keyspaces (with data from sstables) and mark their tables ready for writes. None of these steps apply to virtual tables. There exists at least one writable virtual table, but writes into virtual tables are special and the implementation of writes is virtual-table specific. The existing writable virtual table (`db_config_table`) only updates in-memory state when written to. If a virtual table would like to create sstables, or populate itself with sstable data on startup, it will have to handle this in its own initialization function. Separating `initialize_virtual_tables` like this will allow us to simplify `system_keyspace` initialization, making it independent of services used for distributed communication.	2023-06-18 13:39:27 +02:00
Kamil Braun	c931d9327d	db: system_keyspace: refactor virtual tables creation Split `system_keyspace::make` into two steps: creating regular `system` and `system_schema` tables, then creating virtual tables. This will allow, in later commit, to make `system_keyspace` initialization independent of services used for distributed communication such as `gossiper`. See further commits for details.	2023-06-18 13:39:27 +02:00
Kamil Braun	035045c288	db: system_keyspace: remove `system_keyspace_make` The code can now be inlined in `system_keyspace::make` as we no longer access private members of `database`.	2023-06-18 13:39:27 +02:00
Kamil Braun	cf120e46b8	db: system_keyspace: refactor local system table creation code `system_keyspace_make` would access private fields of `database` in order to create local system tables (creating the `keyspace` and `table` in-memory structures, creating directory for `system` and `system_schema`). Extract this part into `database::create_local_system_table`. Make `database::add_column_family` private.	2023-06-18 13:39:27 +02:00
Kamil Braun	3f04a5956c	replica: database: remove `is_bootstrap` argument from create_keyspace Unused.	2023-06-18 13:39:27 +02:00
Kamil Braun	53cf646103	db: system_keyspace: don't take `sharded<>` references Take `query_processor` and `database` references directly, not through `sharded<...>&`. This is now possible because we moved `query_processor` and `database` construction early, so by the time `system_keyspace` is started, the services it depends on were also already started. Calls to `_qp.local()` and `_db.local()` inside `system_keyspace` member functions can now be replaced with direct uses of `_qp` and `_db`. Runtime assertions for dependant services being initialized are gone.	2023-06-18 13:39:26 +02:00
Tomasz Grabiec	e41ff4604d	Merge 'raft_topology: fencing and global_token_metadata_barrier' from Gusev Petr This is the initial implementation of [this spec](https://docs.google.com/document/d/1X6pARlxOy6KRQ32JN8yiGsnWA9Dwqnhtk7kMDo8m9pI/edit). * the topology version (int64) was introduced, it's stored in topology table and updated through RAFT at the relevant stages of the topology change algorithm; * when the version is incremented, a `barrier_and_drain` command is sent to all the nodes in the cluster, if some node is unavailable we fail and retry indefinitely; * the `barrier_and_drain` handler first issues a `raft_read_barrier()` to obtain the latest topology, and then waits until all requests using previous versions are finished; if this round of RPCs is finished the topology change coordinator can be sure that there are no requests inflight using previous versions and such requests can't appear in the future. * after `barrier_and_drain` the topology change coordinator issues the `fence` command, it stores the current version in local table as `fence_version` and blocks requests with older versions by throwing `stale_topology_exception`; if a request with older version was started before the fence, its reply will also be fenced. * the fencing part of the PR is for the future, when we relax the requirement that all nodes are available during topology change; it should protect the cluster from requests with stale topology from the nodes which was unavailable during topology change and which was not reached by the `barrier_and_drain()` command; * currently, fencing is implemented for `mutation` and `read` RPCs, other RPCs will be handled in the follow-ups; since currently all nodes are supposed to be alive the missing parts of the fencing doesn't break correctness; * along with fencing, the spec above also describes error handling, isolation and `--ignore_dead_nodes` parameter handling, these will be also added later; [this ticket](https://github.com/scylladb/scylladb/issues/14070) contains all that remains to be done; * we don't worry about compatibility when we change topology table schema or `raft_topology_cmd_handler` RPC method signature since the raft topology code is currently hidden by `--experimental raft` flag and is not accessible to the users. Compatibility is maintained for other affected RPCs (mutation, read) - the new `fencing_token` parameter is `rpc::optional`, we skip the fencing check if it's not present. Closes #13884 * github.com:scylladb/scylladb: storage_service: warn if can't find ip for server storage_proxy.cc: add and use global_token_metadata_barrier storage_service: exec_global_command: bool result -> exceptions raft_topology: add cmd_index to raft commands storage_proxy.cc: add fencing to read RPCs storage_proxy.cc: extract handle_read storage_proxy.cc: refactor encode_replica_exception_for_rpc storage_proxy: fix indentation storage_proxy: add fencing for mutation storage_servie: fix indentation storage_proxy: add fencing_token and related infrastructure raft topology: add fence_version raft_topology: add barrier_and_drain cmd token_metadata: add topology version	2023-06-16 12:07:31 +02:00
Petr Gusev	f6b019c229	raft topology: add fence_version It's stored outside of topology table, since it's updated not through RAFT, but with a new 'fence' raft command. The current value is cached in shared_token_metadata. An initial fence version is loaded in main during storage_service initialisation.	2023-06-15 15:48:00 +04:00
Petr Gusev	253d8a8c65	token_metadata: add topology version It's stored in as a static column in topology table, will be updated at various steps of the topology change state machine. The initial value is 1, zero means that topology versions are not yet supported, will be used in RPC handling.	2023-06-15 15:48:00 +04:00
Kefu Chai	4c2df04449	db: config: add uuid_sstable_identifiers_enabled option unlike Cassandra 4.1, this option is true by default, will be used for enabling cluster feature of "UUID_SSTABLE_IDENTIFIERS". not wired yet. please note, because we are still using sstableloader and sstabledump based on 3.x branch, while the Cassandra upstream introduced the uuid sstable identifier in its 4.x branch, these tool fail to work with the sstables with uuid identifier, so this option is disabled when performing these tests. we will enable it once these tools are updated to support the uuid-basd sstable identifiers. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kefu Chai	15543464ce	sstables, replica: support UUID in generation_type this change generalize the value of generation_type so it also supports UUID based identifier. * sstables/generation_type.h: - add formatter and parse for UUID. please note, Cassandra uses a different format for formatting the SSTable identifier. and this formatter suits our needs as it uses underscore "_" as the delimiter, as the file name of components uses dash "-" as the delimiter. instead of reinventing the formatting or just use another delimiter in the stringified UUID, we choose to use the Cassandra's formatting. - add accessors for accessing the type and value of generation_type - add constructor for constructing generation_type with UUID and string. - use hash for placing sstables with uuid identifiers into shards for more uniformed distrbution of tables in shards. * replica/table.cc: - only update the generator if the given generation contains an integer * test/boost: - add a simple test to verify the generation_type is able to parse and format Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kamil Braun	0e36377f56	db: consistency_level: remove overload of `filter_for_query` Not used anymore after the previous commit.	2023-06-14 11:41:36 +02:00
Pavel Emelyanov	c68c154fb6	code: Reduce tracing/hh fanout There are some headers that include tracing/.hh ones despite all they need is forward-declared trace_state_ptr Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14155	2023-06-07 19:19:22 +03:00
Nadav Har'El	5984db047d	Merge 'mv: forbid IS NOT NULL on columns outside the primary key' from Jan Ciołek statement_restrictions: forbid IS NOT NULL on columns outside the primary key IS NOT NULL is currently allowed only when creating materialized views. It's used to convey that the view will not include any rows that would make the view's primary key columns NULL. Generally materialized views allow to place restrictions on the primary key columns, but restrictions on the regular columns are forbidden. The exception was IS NOT NULL - it was allowed to write regular_col IS NOT NULL. The problem is that this restriction isn't respected, it's just silently ignored (see #10365). Supporting IS NOT NULL on regular columns seems to be as hard as supporting any other restrictions on regular columns. It would be a big effort, and there are some reasons why we don't support them. For now let's forbid such restrictions, it's better to fail than be wrong silently. Throwing a hard error would be a breaking change. To avoid breaking existing code the reaction to an invalid IS NOT NULL restrictions is controlled by the `strict_is_not_null_in_views` flag. This flag can have the following values: * `true` - strict checking. Having an `IS NOT NULL` restriction on a column that doesn't belong to the view's primary key causes an error to be thrown. * `warn` - allow invalid `IS NOT NULL` restrictions, but throw a warning. The invalid restrictions are silently ignored. * `false` - allow invalid `IS NOT NULL` restricitons, without any warnings or errors. The invalid restrictions are silently ignored. The default values for this flag are `warn` in `db::config` and `true` in scylla.yaml. This way the existing clusters will have `warn` by default, so they'll get a warning if they try to create such an invalid view. New clusters with fresh scylla.yaml will have the flag set to `true`, as scylla.yaml overwrites the default value in `db::config`. New clusters will throw a hard error for invalid views, but in older existing clusters it will just be a warning. This way we can maintain backwards compatibility, but still move forward by rejecting invalid queries on new clusters. Fixes: #10365 Closes #13013 * github.com:scylladb/scylladb: boost/restriction_test: test the strict_is_not_null_in_views flag docs/cql/mv: columns outside of view's primary key can't be restricted cql-pytest: enable test_is_not_null_forbidden_in_filter statement_restrictions: forbid IS NOT NULL on columns outside the primary key schema_altering_statement: return warnings from prepare_schema_mutations() db/config: add strict_is_not_null_in_views config option statement_restrictions: add get_not_null_columns() test: remove invalid IS NOT NULL restrictions from tests	2023-06-07 12:12:19 +03:00
Jan Ciolek	c67d65987e	db/config: add strict_is_not_null_in_views config option IS NOT NULL shouldn't be allowed on columns which are outside of the materialized view's primary key. It's currently allowed to create views with such restrictions, but they're silently ignored, it's a bug. In the following commits restricting regular columns with IS NOT NULL will be forbidden. This is a breaking change. Some users might have existing code that creates views with such restrictions, we don't want to break it. To deal with this a new feature flag is introduced: strict_is_not_null_in_views. By default it's set to `warn`. If a user tries to create a view with such invalid restrictions they will get a warning saying that this is invalid, but the query will still go through, it's just a warning. The default value in scylla.yaml will be `true`. This way new clusters will have strict enforcement enabled and they'll throw errors when the user tries to create such an invalid view, Old clusters without the flag present in scylla.yaml will have the flag set to warn, so they won't break on an update. There's also the option to set the flag to `false`. It's dangerous, as it silences information about a bug, but someone might want it to silence the warnings for a moment. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-07 01:48:39 +02:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Aarav Arora	a12d2d5f16	fix: keyspace spell Closes #14121	2023-06-04 13:48:43 +03:00
Konstantin Osipov	b39ca97919	consistent_cluster_management: make the default As per our roll out plan, make consistent_cluster_management (aka Raft for schema changes) the default going forward. It means all clusters which upgrade from the previous version and don't have `consistent_cluster_management` explicitly set in scylla.yaml will begin upgrading to Raft once all nodes in the cluster have moved to the new version. Fixes #13980 Closes #13984	2023-06-02 09:05:09 +02:00
Kamil Braun	8be69fc3a0	Merge 'Initialize group0 server on boot before allowing incoming requests' from Gleb The series includes mostly cleanups and one bug fix. The fix is for the race where messages that need to access group0 server are arriving before the server is initialized. * 'gleb/group0-sp-mm-race-v2' of github.com:scylladb/scylla-dev: service: raft: fix typo service: raft: split off setup_group0_if_exist from setup_group0 storage_service: do not allow override_decommission flag if consistent cluster management is enabled storage_service: fix indentation after the previous patch storage_service: co-routinize storage_service::join_cluster() function storage_service: do not reload topology from peers table if topology over raft is enabled storage_service: optimize debug logging code in case debug log is not enabled	2023-06-01 17:37:58 +02:00
Gleb Natapov	acc035b504	storage_service: do not allow override_decommission flag if consistent cluster management is enabled If consistent cluster management is enabled it is not possible to restart decommissioned node since it will not be part of the grouup0.	2023-05-31 10:40:42 +03:00
Avi Kivity	ffce6d94fc	Merge 'service: storage_proxy: make hint write handlers cancellable' from Kamil Braun The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See #3966, #4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Use this new functionality to also cancel hint write handlers when we shutdown. This fixes #8079. Closes #14047 * github.com:scylladb/scylladb: test: reproducer for hints manager shutdown hang test: pylib: ScyllaCluster: generalize config type for `server_add` test: pylib: scylla_cluster: add explicit timeout for graceful server stop service: storage_proxy: make hint write handlers cancellable service: storage_proxy: rename `view_update_handlers_list` service: storage_proxy: make it possible to cancel all write handler types	2023-05-30 01:36:50 +03:00
Kamil Braun	beabb61566	test: reproducer for hints manager shutdown hang	2023-05-29 11:03:39 +02:00
Kamil Braun	0ef35ceed4	service: storage_proxy: make hint write handlers cancellable Whether a write handler should be cancellable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes #8079	2023-05-29 11:03:18 +02:00
Pavel Emelyanov	5861d15912	Merge 'Small gossiper and migration_manager cleanups' from Gleb Some assorted cleanups here: consolidation of schema agreement waiting into a single place and removing unused code from the gossiper. CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/ Reviewed-by: Konstantin Osipov <kostja@scylladb.com> * gleb/gossiper-cleanups of github.com:scylladb/scylla-dev: storage_service: avoid unneeded copies in on_change storage_service: remove check that is always true storage_service: rename handle_state_removing to handle_state_removed storage_service: avoid string copy storage_service: delete code that handled REMOVING_TOKENS state gossiper: remove code related to advertising REMOVING_TOKEN state migration_manager: add wait_for_schema_agreement() function	2023-05-27 10:49:54 +03:00
Gleb Natapov	a429018a8a	migration_manager: add wait_for_schema_agreement() function Several subsystems re-implement the same logic for waiting for schema agreement. Provide the function in the migration_manager and use it instead.	2023-05-25 14:44:53 +03:00
Tomasz Grabiec	51e3b9321b	Merge ' mvcc: make schema upgrades gentle' from Michał Chojnowski After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls. This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery. Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version. After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and by the MVCC version merge ongoing in the background after the upgrade. The series: 1. Does some code cleanup in the mutation_partition area. 2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry). 3. Adds upgrading variants of constructors and apply() for `row` and its wrappers. 4. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains. 5. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version. Fixes #2577 Closes #13761 * github.com:scylladb/scylladb: test: mvcc_test: add a test for gentle schema upgrades partition_version: make partition_entry::upgrade() gentle partition_version: handle multi-schema snapshots in merge_partition_versions mutation_partition_v2: handle schema upgrades in apply_monotonically() partition_version: remove the unused "from" argument in partition_entry::upgrade() row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades partition_version: handle multi-schema entries in partition_entry::squashed partition_snapshot_row_cursor: handle multi-schema snapshots partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots partition_version: add a logalloc::region argument to partition_entry::upgrade() memtable: propagate the region to memtable_entry::upgrade_schema() mutation_partition: add an upgrading variant of lazy_row::apply() mutation_partition: add an upgrading variant of rows_entry::rows_entry mutation_partition: switch an apply() call to apply_monotonically() mutation_partition: add an upgrading variant of rows_entry::apply_monotonically() mutation_fragment: add an upgrading variant of clustering_row::apply() mutation_partition: add an upgrading variant of row::row partition_version: remove _schema from partition_entry::operator<< partition_version: remove the schema argument from partition_entry::read() memtable: remove _schema from memtable_entry row_cache: remove _schema from cache_entry partition_version: remove the _schema field from partition_snapshot partition_version: add a _schema field to partition_version mutation_partition: change schema_ptr to schema& in mutation_partition::difference mutation_partition: change schema_ptr to schema& in mutation_partition constructor mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor mutation_partition: add upgrading variants of row::apply() partition_version: update the comment to apply_to_incomplete() mutation_partition_v2: clean up variants of apply() mutation_partition: remove apply_weak() mutation_partition_v2: remove a misleading comment in apply_monotonically() row_cache_test: add schema changes to test_concurrent_reads_and_eviction mutation_partition: fix mixed-schema apply()	2023-05-24 22:58:43 +02:00
Kefu Chai	b0c40a2a03	db: config: s/ingore/ignore/ this string is used in as the option description in the command line help message. so it is a part of user facing interface. in this change, the typo is fixed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14013	2023-05-24 13:35:24 +03:00
Pavel Emelyanov	5aea6938ae	commitlog: Introduce and use comitlog sched group Nowadays all commitlog code runs in whatever sched group it's kicked from. Since IO prio classes are going to be inherited from the current sched group the commitlog IO loops should be moved into commitlog sched group, not inherit a "random" one. There are currently two places that need correct context for IO -- the .cycle() method and segments replenisher. `$ perf-simple-query --write -c2` results --- Before the patch --- 194898.36 tps ( 56.3 allocs/op, 12.7 tasks/op, 54307 insns/op, 0 errors) 199286.23 tps ( 56.2 allocs/op, 12.7 tasks/op, 54375 insns/op, 0 errors) 199815.84 tps ( 56.2 allocs/op, 12.7 tasks/op, 54377 insns/op, 0 errors) 198260.98 tps ( 56.3 allocs/op, 12.7 tasks/op, 54380 insns/op, 0 errors) 198572.86 tps ( 56.2 allocs/op, 12.7 tasks/op, 54371 insns/op, 0 errors) median 198572.86 tps ( 56.2 allocs/op, 12.7 tasks/op, 54371 insns/op, 0 errors) median absolute deviation: 713.36 maximum: 199815.84 minimum: 194898.36 --- After the patch --- 194751.80 tps ( 56.3 allocs/op, 12.7 tasks/op, 54331 insns/op, 0 errors) 199084.70 tps ( 56.2 allocs/op, 12.7 tasks/op, 54389 insns/op, 0 errors) 195551.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54385 insns/op, 0 errors) 197953.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54386 insns/op, 0 errors) 198710.00 tps ( 56.3 allocs/op, 12.7 tasks/op, 54387 insns/op, 0 errors) median 197953.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54386 insns/op, 0 errors) median absolute deviation: 1131.24 maximum: 199084.70 minimum: 194751.80 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14005	2023-05-23 21:25:57 +03:00
Tomasz Grabiec	809ddd7f79	Merge 'Move pending_ranges and endpoints_for_reading from token_metadata to erm' from Gusev Petr This refactoring is a follow-up for https://github.com/scylladb/scylladb/pull/13376, move per keyspace data structures related to topology changes from `token_metadata` to `erm`. We move `pending_endpoints` and `read_endpoints`, along with their computation logic, from `token_metadata` to `vnode_effective_replication_map`. The `vnode_effective_replication_map` seems more appropriate for them since it contains functionally similar `replication_map` and we will be able to reuse `pending_endpoints/read_endpoints` across keyspaces sharing the same `factory_key`. At present, `pending_endpoints` and `read_endpoints` are updated in the `update_pending_ranges` function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the `migration_info` for specific keyspaces. In this PR we introduce a new `topology_change_info` structure to hold the first part's data and create an `update_topology_change_info` function to update it. This structure will be used in `vnode_effective_replication_map` to compute `pending_endpoints` and `read_endpoints`. This enables the reuse of `topology_change_info` across all keyspaces, unlike the current `update_pending_ranges` implementation, which is another benefit of this refactoring. The PR also optimises `replication_map` memory usage for the case `natural_endpoints_depend_on_token == false`. We store endpoints list only once with special key instead of duplicating them for each `vnode` token. The original `update_pending_ranges` remains unchanged during the PR commits, and will be removed entirely upon transitioning to the new implementation. Closes #13715 * github.com:scylladb/scylladb: token_metadata_test: add a test for everywhere strategy token_metadata_test: check read_endpoints when bootstrapping first node token_metadata_test: refactor tests, extract create_erm token_metadata: drop has_pending_ranges and migration_info effective_replication_map: add has_pending_ranges token_metadata: drop update_pending_ranges effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading token_metadata_test.cc: create token_metadata and replication_strategy as shared pointers vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading calculate_effective_replication_map: compute pending_endpoints and read_endpoints vnode_erm: optimize replication_map vnode_erm::get_range_addresses: use sorted_tokens abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token sequenced_set: add extract_vector method effective_replication_map: clone_endpoints_gently -> clone_data_gently vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints stall_free.hh: add clear_gently for rvalues stall_free.hh: relax Container requirement token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map token_metadata: introduce topology_change_info token_metadata: replace set_topology_transition_state with set_read_new	2023-05-22 21:37:06 +02:00
Tomasz Grabiec	9d4bca26cc	Merge 'raft topology: implement `check_and_repair_cdc_streams` API' from Kamil Braun `check_and_repair_cdc_streams` is an existing API which you can use when the current CDC generation is suboptimal, e.g. after you decommissioned a node the current generation has more stream IDs than you need. In that case you can do `nodetool checkAndRepairCdcStreams` to create a new generation with fewer streams. It also works when you change number of shards on some node. We don't automatically introduce a new generation in that case but you can use `checkAndRepairCdcStreams` to create a new generation with restored shard-colocation. This PR implements the API on top of raft topology, it was originally implemented using gossiper. It uses the `commit_cdc_generation` topology transition state and a new `publish_cdc_generation` state to create new CDC generations in a cluster without any nodes changing their `node_state`s in the process. Closes #13683 * github.com:scylladb/scylladb: docs: update topology-over-raft.md test: topology_experimental_raft: test `check_and_repair_cdc` API raft topology: implement `check_and_repair_cdc_streams` API raft topology: implement global request handling raft topology: introduce `prepare_new_cdc_generation_data` raft_topology: `get_node_to_work_on_opt`: return guard if no node found raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition raft topology: separate `publish_cdc_generation` state raft topology: non-node-specific `exec_global_command` raft topology: introduce `start_operation()` raft topology: non-node-specific `topology_mutation_builder` topology_state_machine: introduce `global_topology_request` topology_state_machine: use `uint16_t` for `enum_class`es raft topology: make `new_cdc_generation_data_uuid` topology-global	2023-05-22 11:33:58 +02:00
Petr Gusev	87307781c4	effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading We already use the new pending_endpoints from erm though the get_pending_ranges virtual function, in this commit we update all the remaining places to use the new implementation in erm, as well as remove the old implementation in token_metadata.	2023-05-21 13:17:42 +04:00
Kamil Braun	13df85ea11	Merge 'Cut feature_service -> system_keyspace dependency' from Pavel Emelyanov This implicit link it pretty bad, because feature service is a low-level one which lots of other services depend on. System keyspace is opposite -- a high-level one that needs e.g. query processor and database to operate. This inverse dependency is created by the feature service need to commit enabled features' names into system keyspace on cluster join. And it uses the qctx thing for that in a best-effort manner (not doing anything if it's null). The dependency can be cut. The only place when enabled features are committed is when gossiper enables features on join or by receiving state changes from other nodes. By that time the sharded<system_keyspace> is up and running and can be used. Despite gossiper already has system keyspace dependency, it's better not to overload it with the need to mess with enabling and persisting features. Instead, the feature_enabler instance is equipped with needed dependencies and takes care of it. Eventually the enabler is also moved to feature_service.cc where it naturally belongs. Fixes: #13837 Closes #13172 * github.com:scylladb/scylladb: gossiper: Remove features and sysks from gossiper system_keyspace: De-static save_local_supported_features() system_keyspace: De-static load_\|save_local_enabled_features() system_keyspace: Move enable_features_on_startup to feature_service (cont) system_keyspace: Move enable_features_on_startup to feature_service feature_service: Open-code persist_enabled_feature_info() into enabler gms: Move feature enabler to feature_service.cc gms: Move gossiper::enable_features() to feature_service::enable_features_on_join() gms: Persist features explicitly in features enabler feature_service: Make persist_enabled_feature_info() return a future system_keyspace: De-static load_peer_features() gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features() gossiper: Enable features and register enabler from outside gms: Add feature_service and system_keyspace to feature_enabler	2023-05-18 18:21:06 +02:00
Pavel Emelyanov	5216dcb1b3	Merge 'db/system_keyspace: remove the dependency on storage_proxy' from Botond Dénes The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`. This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups. This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR. Refs: #11870 Closes #13869 * github.com:scylladb/scylladb: db/system_keyspace: remove dependency on storage_proxy db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent replica: add query.hh	2023-05-18 10:53:27 +03:00
Pavel Emelyanov	29fffaa160	schema_tables: Use sharded<database>& variable The auto& db = proxy.local().get_db() is called few lines above this patch, so the &db can be reused for invoke_on_all() call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13896	2023-05-16 12:57:47 +03:00
Botond Dénes	0cff0ffa08	Merge 'alternator,config: make alternator_timeout_in_ms live-updateable' from Kefu Chai before this change, alternator_timeout_in_ms is not live-updatable, as after setting executor's default timeout right before creating sharded executor instances, they never get updated with this option anymore. but many users would like to set the driver timers based on server timers. we need to enable them to configure timeout even when the server is still running. in this change, * `alternator_timeout_in_ms` is marked as live-updateable * `executor::_s_default_timeout` is changed to a thread_local variable, so it can be updated by a per-shard updateable_value. and it is now a updateable_value, so its variable name is updated accordingly. this value is set in the ctor of executor, and it is disconnected from the corresponding named_value<> option in the dtor of executor. * alternator_timeout_in_ms is passed to the constructor of executor via sharded_parameter, so `executor::_timeout_in_ms` can be initialized on per-shard basis * `executor::set_default_timeout()` is dropped, as we already pass the option to executor in its ctor. Fixes #12232 Closes #13300 * github.com:scylladb/scylladb: alternator: split the param list of executor ctor into multi lines alternator,config: make alternator_timeout_in_ms live-updateable	2023-05-15 10:16:29 +03:00
Piotr Dulikowski	760651b4ad	error injection: allow enabling injections via config Currently, error injections can be enabled either through HTTP or CQL. While these mechanisms are effective for injecting errors after a node has already started, it can't be reliably used to trigger failures shortly after node start. In order to support this use case, this commit adds possibility to enable some error injections via config. A configuration option `error_injections_at_startup` is added. This option uses our existing configuration framework, so it is possible to supply it either via CLI or in the YAML configuration file. - When passed in commandline, the option is parsed as a semicolon-separated list of error injection names that should be enabled. Those error injections are enabled in non-oneshot mode. The CLI option is marked as not used in release mode and does not appear in the option list. Example: --error-injections-at-startup failure_point1;failure_point2 - When provided in YAML config, the option is parsed as a list of items. Each item is either a string or a map or parameters. This method is more flexible as it allows to provide parameters for each injection point. At this time, the only benefit is that it allows enabling points in oneshot mode, but more parameters can be added in the future if needed. Explanatory example: error_injections_at_startup: - failure_point1 # enabled in non-oneshot mode - name: failure_point2 # enabled in oneshot mode one_shot: true # due to one_shot optional parameter The primary goal of this feature is to facilitate testing of raft-based cluster features. An error injection will be used to enable an additional feature to simulate node upgrade. Tests: manual Closes #13861	2023-05-15 09:14:07 +03:00
Botond Dénes	157fdb2f6d	db/system_keyspace: remove dependency on storage_proxy The methods that take storage_proxy as argument can now accept a replica::database instead. So update their signatures and update all callers. With that, system_keyspace.* no longer depends on storage_proxy directly.	2023-05-12 07:27:55 -04:00
Botond Dénes	f4f757af23	db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent Use the recently introduced replica side query utility functions to query the content of the system tables. This allows us to cut the dependency of the system keyspace on storage proxy. The methods still take storage proxy parameter, this will be replaced with replica::database in the next patch. There is still one hidden storage proxy dependency left, via clq3::query_processor. This will be addressed later.	2023-05-12 07:27:55 -04:00
Wojciech Mitros	9ae1b02144	service: revoke permissions on functions when a function/keyspace is dropped Currently, when a user has permissions on a function/all functions in keyspace, and the function/keyspace is dropped, the user keeps the permissions. As a result, when a new function/keyspace is created with the same name (and signature), they will be able to use it even if no permissions on it are granted to them. Simliarly to regular UDFs, the same applies to UDAs. After this patch, the corresponding permissions on functions are dropped when a function/keyspace is dropped. Fixes #13820 Closes #13823	2023-05-10 14:39:42 +03:00
Petr Gusev	052b91fb1f	storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading We are going to use remapped_endpoints_for_reading, we need to make sure we use it in the right place. The get_live_sorted_endpoints function looks like what we need - it's used in all read code paths. From its name, however, this was not obvious. Also, we add the parameter ks_name as we'll need it to pass to remapped_endpoints_for_reading.	2023-05-09 18:42:03 +04:00
Kamil Braun	acfb6bf3ed	topology_state_machine: introduce `global_topology_request` `topology` currently contains the `requests` map, which is suitable for node-specific requests such as "this node wants to join" or "this node must be removed". But for requests for operations that affect the cluster as a whole, a separate request type and field is more appropriate. Introduce one. The enum currently contains the option `new_cdc_generation` for requests to create a new CDC generation in the cluster. We will implement the whole procedure in later commits.	2023-05-08 16:46:14 +02:00
Kamil Braun	93dcdcd4eb	raft topology: make `new_cdc_generation_data_uuid` topology-global - make it a static column in `system.topology` - move it from node-specific `ring_slice` to cluster-global `topology` We will use it in scenarios where no node is transitioning. Also make it `std::optional` in topology for consistency with other fields (previously, the 'no value' state for this field was represented using default-constructed `utils::UUID`).	2023-05-08 16:46:14 +02:00
Kefu Chai	5fa459bd1a	treewide: do not include unused header since #13452, we switched most of the caller sites from std::regex to boost::regex. in this change, all occurences of `#include <regex>` are dropped unless std::regex is used in the same source file. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13765	2023-05-07 19:01:29 +03:00
Avi Kivity	f125a3e315	Merge 'tree: finish the reader_permit state renames' from Botond Dénes In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`. This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date. Closes #13573 * github.com:scylladb/scylladb: reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes reader_concurrency_semaphore: update API w.r.t. recent permit state name changes reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes	2023-05-04 18:29:04 +03:00
Avi Kivity	1d351dde06	Merge 'Make S3 client work with real S3' from Pavel Emelyanov Current S3 client was tested over minio and it takes few more touches to work with amazon S3. The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS. So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs. In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust. fixes: #13425 Closes #13493 * github.com:scylladb/scylladb: doc: Add a document describing how to configure S3 backend s3/test: Add ability to run boost test over real s3 s3/client: Sign requests if configured s3/client: Add connection factory with DNS resolve and configurable HTTPS s3/client: Keep server port on config s3/client: Construct it with config s3/client: Construct it with sstring endpoint sstables: Make s3_storage with endpoint config sstables_manager: Keep object storage configs onboard code: Introduce conf/object_storage.yaml configuration file	2023-05-04 18:08:54 +03:00
Michał Chojnowski	a70c5704df	mutation_partition: change schema_ptr to schema& in mutation_partition constructor Cosmetic change. See the preceding commit for details.	2023-05-04 02:37:29 +02:00
Pavel Emelyanov	2f6aa5b52e	code: Introduce conf/object_storage.yaml configuration file In order to access real S3 bucket, the client should use signed requests over https. Partially this is due to security considerations, partially this is unavoidable, because multipart-uploading is banned for unsigned requests on the S3. Also, signed requests over plain http require signing the payload as well, which is a bit troublesome, so it's better to stick to secure https and keep payload unsigned. To prepare signed requests the code needs to know three things: - aws key - aws secret - aws region name The latter could be derived from the endpoint URL, but it's simpler to configure it explicitly, all the more so there's an option to use S3 URLs without region name in them we could want to use some time. To keep the described configuration the proposed place is the object_storage.yaml file with the format endpoints: - name: a.b.c port: 443 aws_key: 12345 aws_secret: abcdefghijklmnop ... When loaded, the map gets into db::config and later will be propagated down to sstables code (see next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:19:15 +03:00

1 2 3 4 5 ...

3122 Commits