scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-18 13:52:07 +00:00

Author	SHA1	Message	Date
Lakshmi Narayanan Sreethar	84d06a13c7	api: compaction: add `consider_only_existing_data` option Added a new parameter `consider_only_existing_data` to major compaction API endpoints. When enabled, major compaction will: - Force-flush all tables. - Force a new active segment in the commit log. - Compact all existing SSTables and garbage-collect tombstones by only checking the SSTables being compacted. Memtables, commit logs, and other SSTables not part of the compaction will not be checked, as they will only contain newer data that arrived after the compaction started. The `consider_only_existing_data` is passed down to the compaction descriptor's `gc_check_only_compacting_sstables` option to ensure that only the existing data is considered for garbage collection. The option is also passed to the `maybe_flush_commitlog` method to make sure all the tables are flushed and a new active segment is created in the commit log. Fixes #19728 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-09-05 17:25:45 +05:30
Botond Dénes	9f9346fc59	Merge 'nodetool: tasks: add nodetool commands to track task manager tasks' from Aleksandra Martyniuk Add nodetool commands to manage task manager tasks: - tasks abort - aborts the task - tasks list - lists all tasks in the module - tasks modules - lists all modules - tasks set-ttl - sets task ttl - tasks status - gets status of the task - tasks tree - gets statuses of the task and all its desendent's - tasks ttl - gets task ttl - tasks wait - waits for the task and gets its status Fixes: https://github.com/scylladb/scylladb/issues/19201. Closes scylladb/scylladb#19614 * github.com:scylladb/scylladb: test: nodetool: add tests for tasks commands nodetool: tasks: add nodetool commands to track task manager tasks api: task_manager: return status 403 if a task is not abortable api: task_manager: return none instead of empty task id api: task_manager: add timeout to wait_task api: task_manager: add operation to get ttl nodetool: add suboperations support nodetool: change operations_with_func type nodetool: prepare operation related classes for suboperations	2024-08-30 07:37:37 +03:00
Aleksandra Martyniuk	627fc46ca7	api: task_manager: return status 403 if a task is not abortable	2024-08-29 13:53:40 +02:00
Aleksandra Martyniuk	5bcff4d544	api: task_manager: add timeout to wait_task	2024-08-29 13:53:40 +02:00
Aleksandra Martyniuk	3d78172328	api: task_manager: add operation to get ttl	2024-08-29 13:53:39 +02:00
Pavel Emelyanov	11a04bfb66	code: Introduce restore API method The method starts a task that uses sstables_loader load-and-stream functionality to bring new sstables into the cluster. The existing load-and-stream picks up sstables from upload/ directory, the newly introduced task collects them from S3 bucket and given prefix (that correspond to the path where backup API method put them). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-08-28 15:42:49 +03:00
Pavel Emelyanov	a812f13ddd	code: Introduce backup API method The method starts a task that uploads all files from the given keyspace's snapshot to the requested endpoint/bucket. The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion (hint: it's good to have non-zero TTL value to make sure fast backups don't finish before the caller manages to call wait_task API). If snapshot doesn't exist, nothing happens (FIXME, need to return back an error in that case). If endpoint is not configured locally, the API call resolves with bad-request instantly. Sstables components are scanned for all tables in the keyspace and are uploaded into the /bucket/${cf_name}/${snapshot_name}/ path. Task is not abortable (FIXME -- to be added) and doesn't really report its progress other than running/done state (FIXME -- to be added too). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-08-22 19:47:06 +03:00
Benny Halevy	0419b1d522	nodetool: rebuild: add force option To be used to force usage of source_dc, even when it is unsafe for rebuild. Update docs and add test/nodetool/test_rebuild.py Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-19 17:20:12 +03:00
Michał Jadwiszczak	870bdaa6b1	api/cql_server_test: add CQL server testing API Add a CQL server testing API with and endpoint to dump service level parameters of all CQL connections. This endpoint will be later used to test functionality of automated updating CQL connections parameters.	2024-08-08 10:42:09 +02:00
Aleksandra Martyniuk	5f7f403a15	tasks: api: add virtual task support Virtual tasks are supported by get_task_status, abort_task and wait_task. Task status returned by get_task_status and wait_task: - contains task_kind to indicate whether it's virtual (cluster) or regular (node) task; - children list apart from task_id contains node address of the task.	2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk	20ba7ceff9	tasks: api: add virtual tasks support to get_tasks task_manager/list_module_tasks/{module} starts supporting virtual tasks, which means that their stats will also be shown for users. Additional task_kind param is added to indicate whether the task is virutal (cluster-wide) or regular (node-wide). Support in other paths will be added in following patches.	2024-07-23 13:35:01 +02:00
Łukasz Paszkowski	781eb7517c	api/system: add highest_supported_sstable_format path Current upgrade dtest rely on a ccm node function to get_highest_supported_sstable_version() that looks for r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files. Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default and there is no cluster feature for it. Thus get_highest_supported_sstable_version() returns an empty list resulting in the upgrade tests failures. This change introduces a seperate API path that returns the highest supported sstable format (one of la, mc, md, me) by a scylla node. Fixes scylladb/scylladb#19772 Backports to 6.0 and 6.1 required. The current upgrade test in dtest checks scylla upgrades up to version 5.4 only. This patch is a prerequisite to backport the upgrade tests fix in dtest. Closes scylladb/scylladb#19787	2024-07-21 17:00:19 +03:00
Emil Maskovsky	a9e985fcc9	raft: add the read barrier REST API This will allow to trigger the read barrier directly via the API, instead of doing work-arounds (like dropping a non-existent table). The intended use-case is in the Scylla Manager, to make sure that the database schema is up to date after the data has been backed up and before attempting to backup the database schema. The database schema in particular is being backed up just on a single node, which might not yet have the schema at least as new as the data (data can be migrated to a newer schema, but not a vice-versa). The read barrier issued on the node should ensure that the node should have the schema at least as new as the data or newer. Closes #19213	2024-07-08 18:16:27 +02:00
Nadav Har'El	4faceeaa33	Merge 'treewide: drop thrift support' from Kefu Chai thrift support was deprecated since ScyllaDB 5.2 > Thrift API - legacy ScyllaDB (and Apache Cassandra) API is > deprecated and will be removed in followup release. Thrift has > been disabled by default. so let's drop it. in this change, * thrift protocol support is dropped * all references to thrift support in document are dropped * the "thrift_version" column in system.local table is preserved for backward compatibility, as we could load from an existing system.local table which still contains this clolumn, so we need to write this column as well. * "/storage_service/rpc_server" is only preserved for backward compatibility with java-based nodetool. Fixes #3811 Fixes #18416 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> - [x] not a fix, no need to backport Closes scylladb/scylladb#18453 * github.com:scylladb/scylladb: config: expand on rpc_keepalive's description api: s/rpc/thrift/ db/system_keyspace: drop thrift_version from system.local table transport: do not return client_type from cql_server::connection::make_client_key() treewide: drop thrift support	2024-06-17 22:36:49 +03:00
Botond Dénes	feea609e37	api/error_injection: add getter for error_injection Allow external code to obtain information about an error injection point, including whether it is enabled, and importantly, what its parameters are. Together with the `set_parameter()` added in the previous patch, this allows tests to read out the values of internal parameters, via a set_parameter() injection point.	2024-06-11 04:17:48 -04:00
Kefu Chai	c75442bc2a	api: s/rpc/thrift/ replace all occurrences of "rpc" in function names and debugging messages to "thrift", as "rpc" is way too general, and since we are removing "thrift" support, let's take this opportunity to use a more specific name. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-06-07 09:23:10 +08:00
Kefu Chai	ad649be1bf	treewide: drop thrift support thrift support was deprecated since ScyllaDB 5.2 > Thrift API - legacy ScyllaDB (and Apache Cassandra) API is > deprecated and will be removed in followup release. Thrift has > been disabled by default. so let's drop it. in this change, * thrift protocol support is dropped * all references to thrift support in document are dropped * the "thrift_version" column in system.local table is preserved for backward compatibility, as we could load from an existing system.local table which still contains this clolumn, so we need to write this column as well. * "/storage_service/rpc_server" is only preserved for backward compatibility with java-based nodetool. * `rpc_port` and `start_rpc` options are preserved, but they are marked as "Unused". so that the new release of scylladb can consume existing scylla.yaml configurations which might contain these settings. by making them deprecated, user will be able get warned, and update their configurations before we actually remove them in the next major release. Fixes #3811 Fixes #18416 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-06-07 06:44:59 +08:00
Kefu Chai	e70b116333	api/api-doc/utils: fix a typo in description s/mintues/minutes/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18869	2024-05-27 14:15:23 +03:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Tomasz Grabiec	7956a2991e	api, storage_service: Introduce API to wait for topology to quiesce	2024-05-16 00:28:47 +02:00
Botond Dénes	fd25bb6f9f	api/storage_service: add tablet support for /storage_service/tokens_endpoint Add a keyspace and cf parameter. When specified, the endpoint will return token -> primary replica mapping for the table's tablet tokens, not the vnodes.	2024-05-13 07:09:20 -04:00
Pavel Emelyanov	8bad828208	api: Add method to delete replica from tablet Copied from the add_replica counterpart TODO: Generalize common parts of move_tablet and add_\|del_tablet_replica Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-15 16:31:07 +03:00
Pavel Emelyanov	79ad760e95	api: Add method to add replica to a tablet The new API submits rebuild transition with new replicas set to be old (current) replicas plus the provided one. It looks and acts like the move_tablet API call with several changes: - lacks the "source" replica argument - submits "rebuild" transition kind - cross racks checks are not performed The 'force' argument is inherited from move_tablet, but is unused now and is left for future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-04 09:22:16 +03:00
Avi Kivity	72bbe75d5b	Merge 'Fix node replace with tablets for RF=N' from Tomasz Grabiec This PR fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because tablet replica allocation for rebuild will not be able to find a viable destination, as the replacing node is not considered to be a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes https://github.com/scylladb/scylladb/issues/17025 Nodes in the left state will be kept in tablet replica sets for a while after node replace is done, until the new replica is rebuilt. So we need to know about those node's location (dc, rack) for two reasons: 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first. 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement. It's ok to not know the IP, and we don't keep it. Those nodes will not be present in the IP-based replica sets, e.g. those returned by get_natural_endpoints(), only in host_id-based replica sets. storage_proxy request coordination is not affected. Nodes in the left state are still not present in token ring, and not considered to be members of the ring (datacanter endpoints excludes them). In the future we could make the change even more transparent by only loading locator::node* for those nodes and keeping node* in tablet replica sets. Currently left nodes are never removed from topology, so will accumulate in memory. We could garbage-collect them from topology coordinator if a left node is absent in any replica set. That means we need a new state - left_for_real. Closes scylladb/scylladb#17388 * github.com:scylladb/scylladb: test: py: Add test for view replica pairing after replace raft, api: Add RESTful API to query current leader of a raft group test: test_tablets_removenode: Verify replacing when there is no spare node doc: topology-on-raft: Document replace behavior with tablets tablets, raft topology: Rebuild tablets after replacing node is normal tablets: load_balancer: Access node attributes via node struct tablets: load_balancer: Extract ensure_node() mv: Switch to using host_id-based replica set effective_replication_map: Introduce host_id-based get_replicas() raft topology: Keep nodes in the left state to topology tablets: Introduce read_required_hosts()	2024-03-18 16:16:08 +02:00
Tomasz Grabiec	6d50e93f10	raft, api: Add RESTful API to query current leader of a raft group Example: $ curl -X GET "http://127.0.0.1:10000/raft/leader_host" "f7f57588-62de-4cac-9e4b-c62bfc458d91" Accepts optional group_id param, defaults to group0.	2024-03-15 13:20:08 +01:00
Benny Halevy	530d270828	api: /storage_service/tablets/balancing: fix incorrect operation summary It was probably copy-pasted from /storage_service/tablets/move Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#17811	2024-03-14 22:52:57 +01:00
Patryk Wrobel	51da80da7d	storage_service/ownership/{keyspace}: handle requests when tablets are enabled Before this change, when user tried to utilize 'storage_service/ownership/{keyspace}' API with keyspace parameter that uses tablets, then internal error was thrown. The code was calling a function, that is intended for vnodes: get_vnode_effective_replication_map(). This commit introduces graceful handling of such scenario and extends the API to allow passing 'cf' parameter that denotes table name. Now, when keyspace uses tablets and cf parameter is not passed a descriptive error message is returned via BAD_REQUEST. Users cannot query ownership for keyspace that uses tablets, but they can query ownership for a table in a given keyspace that uses tablets. Also, new tests have been added to test/rest_api/test_storage_service.py and to test/topology_experimental_raft/test_tablets.py in order to verify the behavior with and without tablets enabled. Refs: scylladb#17342 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-03-11 09:52:23 +01:00
Botond Dénes	050c6dcad7	api: storage_service/keyspaces: add replication filter To allow to filter the returned keyspaces based by the replication they use: tablets or vnodes. The filter can be disabled by omitting the parameter or passing "all". The default is "all". Fixes: #16509 Closes scylladb/scylladb#17319	2024-02-20 09:04:41 +01:00
Patryk Wrobel	3842bf18a7	storage_service/range_to_endpoint_map: allow API to properly handle tablets This API endpoint was failing when tablets were enabled because of usage of get_vnode_effective_replication_map(). Moreover, it was providing an error message that was not user-friendly. This change extends the handler to properly service the incoming requests. Furthermore, it introduces two new test cases that verify the behavior of storage_service/range_to_endpoint_map API. It also adjusts the test case of this endpoint for vnodes to succeed when tablets are enabled by default. The new logic is as follows: - when tablets are disabled then users may query endpoints for a keyspace or for a given table in a keyspace - when tablets are enabled then users have to provide table name, because effective replication map is per-table When user does not provide table name when tablets are enabled for a given keyspace, then BAD_REQUEST is returned with a meaningful error message. Fixes: scylladb#17343 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com> Closes scylladb/scylladb#17372	2024-02-18 19:21:53 +02:00
Kamil Braun	e9e24f47ec	Merge 'raft topology: implement upgrade and recovery procedure' from Piotr Dulikowski This PR implements a procedure that upgrades existing clusters to use raft-based topology operations. The procedure does not start automatically, it must be triggered manually by the administrator after making sure that no topology operations are currently running. Upgrade is triggered by sending `POST /storage_service/raft_topology/upgrade` request. This causes the topology coordinator to start who drives the rest of the process: it builds the `system.topology` state based on information observed in gossip and tells all nodes to switch to raft mode. Then, topology coordinator runs normally. Upgrade progress is tracked in a new static column `upgrade_state` in `system.topology`. The procedure also serves as an extension to the current recovery procedure on raft. The current recovery procedure requires restarting nodes in a special mode which disables raft, perform `nodetool removenode` on the dead nodes, clean up some state on the nodes and restart them so that they automatically rebuild the group 0. Raft topology fits into existing procedure by falling back to legacy topology operations after disabling raft. After rebuilding the group 0, upgrade needs to be triggered again. Because upgrade is manual and it might not be convenient for administrators to run it right after upgrading the cluster, we allow the cluster to operate in legacy topology operations mode until upgrade, which includes allowing new nodes to join. In order to allow it, nodes now ask the cluster about the mode they should use to join before proceeding by using a new `JOIN_NODE_QUERY` RPC. The procedure is explained in more detail in `topology-over-raft.md`. Fixes: https://github.com/scylladb/scylladb/issues/15008 Closes scylladb/scylladb#17077 * github.com:scylladb/scylladb: test/topology_custom: upgrade/recovery tests for topology on raft cdc/generation_service: in legacy mode, fall back to raft tables system_keyspace: add read_cdc_generation_opt cdc/generation_service: turn off gossip notifications in raft topo mode cql_test_env: move raft_topology_change_enabled var earlier group0_state_machine: pull snapshot after raft topology feature enabled storage_service: disable persistent feature enabler on upgrade storage_service: replicate raft features to system.peers storage_service: gossip tokens and cdc generation in raft topology mode API: add api for triggering and monitoring topology-on-raft upgrade storage_service: infer which topology operations to use on startup storage_service: set the topology kind value based on group 0 state raft_group0: expose link to the upgrade doc in the header feature_service: fall back to checking legacy features on startup storage_service: add fiber for tracking the topology upgrade progress gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES topology_coordinator: implement core upgrade logic topology_coordinator: extract top-level error handling logic storage_service: initialize discovery leader's state earlier topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data topology_state_machine: introduce upgrade_state storage_service: disallow topology ops when upgrade is in progress raft_group0_client: add in_recovery method storage_service: introduce join_node_query verb raft_group0: make discover_group0 public raft_group0: filter current node's IP in discover_group0 raft_group0: remove my_id arg from discover_group0 storage_service: make _raft_topology_change_enabled more advanced docs: document raft topology upgrade and recovery	2024-02-09 11:54:53 +01:00
Kefu Chai	c1c96bbc16	api/storage_service: drop /storage_service/describe_ring/ API per its description, "`/storage_service/describe_ring/`" returns the token ranges of an arbitrary keyspace. actually, it returns the first keyspace which is of non-local-vnode-based-strategy. this API is not used by nodetool, neither is it exercised in dtest. scylla-manager has a wrapper for this API though, but that wrapper is not used anywhere. in this change, this API is dropped. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17197	2024-02-09 12:49:21 +02:00
Piotr Dulikowski	a672383c2a	API: add api for triggering and monitoring topology-on-raft upgrade Implements the /storage_service/raft_topology/upgrade route. The route supports two methods: POST, which triggers the cluster-wide upgrade to topology-on-raft, and GET which reports the status of the upgrade.	2024-02-08 19:12:28 +01:00
Botond Dénes	35da9551fb	Merge 'storage_service: Add describe_ring support for tablet table' from Asias He The table query param is added to get the describe_ring result for a given table. Both vnode table and tablet table can use this table param, so it is easier for users to user. If the table param is not provided by user and the keyspace contains tablet table, the request will be rejected. E.g., curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles" curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1" Refs #16509 Closes scylladb/scylladb#17118 * github.com:scylladb/scylladb: tablets: Convert to use the new version of for_each_tablet storage_service: Add describe_ring support for tablet table storage_service: Mark host2ip as const tablets: Add for_each_tablet_gently	2024-02-07 10:41:36 +02:00
Asias He	04773bd1df	storage_service: Add describe_ring support for tablet table The table query param is added to get the describe_ring result for a given table. Both vnode table and tablet table can use this table param, so it is easier for users to user. If the table param is not provided by user and the keyspace contains tablet table, the request will be rejected. E.g., curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles" curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1" Refs #16509	2024-02-05 18:11:07 +08:00
Aleksandra Martyniuk	89c683f51a	api: service: add force param to move_tablet api Force flag is added to /storage_service/tablets/move. If force is set to true, replication strategy constraints regarding racks and dcs can be broken.	2024-02-02 19:08:01 +01:00
Botond Dénes	1a0300dba6	Merge 'compaction_manager: flush tables before cleanup' from Kefu Chai according to the document "nodetool cleanup" > Triggers removal of data that the node no longer owns currently, scylla performs cleanup by rewriting the sstables. but commitlog segments may still contain the mutations to the tables which are dropped during sstable rewriting. when scylla server restarts, the dirty mutations are replayed to the memtable. if any of these dirty mutations changes the tables cleaned up. the stale data are reapplied. this would lead to data resurrection. so, in this change we following the same model of major compaction where we 1. forcing new active segment, 2. flushing tables being cleaned up 3. perform cleanup using compaction Fixes #4734 Closes scylladb/scylladb#16757 * github.com:scylladb/scylladb: storage_service: fall back to local cleanup in cleanup_all compaction: format flush_mode without the helper compaction_manager: flush all tables before cleanup replica: table: pass do_flush to table::perform_cleanup_compaction() api, compaction: promote flush_mode	2024-02-01 13:47:45 +02:00
Kefu Chai	4ec104e086	api: storage_service: correct a typo s/a any keyspace/a given keyspace/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17098	2024-02-01 10:55:58 +02:00
Kefu Chai	5e0b3671d3	storage_service: fall back to local cleanup in cleanup_all before this change, if no keyspaces are specified, scylla-nodetool just enumerate all non-local keyspaces, and call "/storage_service/keyspace_cleanup" on them one after another. this is not quite efficient, as each this RESTful API call force a new active commitlog segment, and flushes all tables. so, if the target node of this command has N non-local keyspaces, it would repeat the steps above for N times. this is not necessary. and after a topology change, we would like to run a global "nodetool cleanup" without specifying the keyspace, so this is a typical use case which we do care about. to address this performance issue, in this change, we improve an existing RESTful API call "/storage_service/cleanup_all", so if the topology coordinator is not enabled, we fall back to a local cleanup to cleanup all non-local keyspaces. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-01 11:25:53 +08:00
Botond Dénes	d202d32f81	Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun This allows the user of `raft::server` to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb/scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (`ff386e7a44`) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables). The PR adds the API to `raft::server` and a HTTP endpoint that uses it. In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.) Closes scylladb/scylladb#16816 * github.com:scylladb/scylladb: raft: remove `empty()` from `fsm_output` test: add test for manual triggering of Raft snapshots api: add HTTP endpoint to trigger Raft snapshots raft: server: add `trigger_snapshot` API raft: server: track last persisted snapshot descriptor index raft: server: framework for handling server requests raft: server: inline `poll_fsm_output` raft: server: fix indentation raft: server: move `io_fiber`'s processing of `batch` to a separate function raft: move `poll_output()` from `fsm` to `server` raft: move `_sm_events` from `fsm` to `server` raft: fsm: remove constructor used only in tests raft: fsm: move trace message from `poll_output` to `has_output` raft: fsm: extract `has_output()` raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor` raft: server: pass `*_aborted` to `set_exception` call	2024-01-29 15:06:04 +02:00
Kamil Braun	617e09137d	api: add HTTP endpoint to trigger Raft snapshots This uses the `trigger_snapshot()` API added in previous commit on a server running for the given Raft group. It can be used for example in tests or in the context of disaster recovery (ref scylladb/scylladb#16683).	2024-01-23 16:48:28 +01:00
Kefu Chai	0dbb0ed09f	api: storage_service: correct a typo s/trough/through/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16870	2024-01-19 10:21:41 +02:00
Gleb Natapov	97ab3f6622	storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator Introduce new REST API "/storage_service/cleanup_all" that, when triggered, instructs the topology coordinator to initiate cluster wide cleanup on all dirty nodes. It is done by introducing new global command "global_topology_request::cleanup".	2024-01-14 15:45:53 +02:00
Kefu Chai	8c4576f55d	api: storage_service: correct the descriptions of two APIs this change is more about documentation of the RESTful API of storage_service. as we define the API using Swagger 2.0 format, and generate the API document from the definitions. so would be great if the document matches with the API. in this change, since the keyspace is not queried but mutated. so changed to a more accurate description. from the code perspective, it is but cosmetic. as we don't read the description fields or verify them in our tests. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16637	2024-01-11 08:28:14 +02:00
Eliran Sinvani	4c60804c4c	rest api: Add an api for profile dumping As part of code coverage support we need to work with dumped profiles for ScyllaDB executables. Those profiles are created on two occasions: 1. When an application exits notmaly (which will trigger __llvm_dump_profile registered in the exit hooks. 2. For ScyllaDB commit `d7b524cf10` introduced a manual call to __llvm_dump_profile upon receiving a SIGTERM signal. This commit adds a third option, a rest API to dump the profile. In addition the target file is logged and the counters are reset, which enables incremental dumping of the profile. Except for logging, if the executable is not instrumented, this API call becomes a no-op so it bears minimal risk in keeping it in our releases. Specifically for code coverage, the gain will be that we will not be required to change the entire test run to shut down clusters gracefully and this will cause minimal effect to the actual test behavior. The change was tested by manually triggering the API in with and without instrumentation as well as re triggering it with write permissions for the profile file disabled (to test fault tolerance). Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2023-12-27 07:06:54 +02:00
Aleksandra Martyniuk	b485897704	api: tasks: api for starting async compaction For all compaction types which can be started with api, add an asynchronous version of api, which returns task_id of the corresponding task manager task. With the task_id a user can check task status, abort, or wait for it, using task manager api.	2023-12-11 11:39:33 +01:00
Avi Kivity	9c0f05efa1	Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later. This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted. The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained. The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was. This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas. Closes scylladb/scylladb#15847 * github.com:scylladb/scylladb: test: tablets: Add test for failed streaming being fenced away error_injection: Introduce poll_for_message() error_injection: Make is_enabled() public api: Add API to kill connection to a particular host range_streamer: Do not block topology change barriers around streaming range_streamer, tablets: Do not keep token metadata around streaming tablets: Fail gracefully when migrating tablet has no pending replica storage_service, api: Add API to disable tablet balancing storage_service, api: Add API to migrate a tablet storage_service, raft topology: Run streaming under session topology guard storage_service, tablets: Use session to guard tablet streaming tablets: Add per-tablet session id field to tablet metadata service: range_streamer: Propagate topology_guard to receivers streaming: Always close the rpc::sink storage_service: Introduce concept of a topology_guard storage_service: Introduce session concept tablets: Fix topology_metadata_guard holding on to the old erm docs: Document the topology_guard mechanism	2023-12-07 16:29:02 +02:00
Tomasz Grabiec	733eb21601	api: Add API to kill connection to a particular host For testing failure scenarios.	2023-12-06 18:36:17 +01:00
Tomasz Grabiec	d1c1b59236	storage_service, api: Add API to disable tablet balancing Load balancing needs to be disabled before making a series of manual migrations so that we don't fight with the load balancer. Also will be used in tests to ensure tablets stick to expected locations.	2023-12-06 18:36:17 +01:00
Tomasz Grabiec	1f57d1ea28	storage_service, api: Add API to migrate a tablet Will be used in tests, or for hot fixes in production.	2023-12-06 18:36:17 +01:00
Yaniv Kaul	ae2ab6000a	Typos: fix typos in code Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors. Refs: https://github.com/scylladb/scylladb/issues/16255	2023-12-05 15:18:11 +02:00

1 2 3 4 5

221 Commits