scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-01 13:45:53 +00:00

Author	SHA1	Message	Date
Nadav Har'El	dce47a81b0	alternator, tablets: return error if enabling Streams with tablets Alternator Streams doesn't yet work on tables using tablets (this is issue #16317). Before this patch, an attempt to enable it results in an unsightly InternalServerError, which isn't terrible - but we can do better. So in this patch, we make the attempt to enable Streams and tablets together into a clear error. The error message points to the open issue, and also suggests how to create a table that uses vnodes, not tablets. Unfortunately, there are slightly two different code paths and error messages for two cases: One case is the creation of a new table (where the validation happens before the keyspace is actually created), and the other case is an attempt to enable streams on an existing table with an existing keyspace (which already might or might not be using tablets). This patch also adds a test that verifies that trying to enable Streams with tablets is an error - in both cases (table creation and update). Obviously, this test - and the validation code - should be removed once the issue is solved and Alternator Streams begins working with tablets. Fixes #16497 Refs #16807 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17311	2024-02-13 16:42:35 +02:00
Raphael S. Carvalho	54226dddf5	replica: Kill vnode-oriented cleanup handling for multiple compaction groups With tablets, we don't use vnode-oriented sstable cleanup. So let's just remove unused code and bail out silently if sharding is tablet based. The reason for silence is that we don't want to break tests that might be reused for tablets, and it's not a problem for sstable cleanup to be ignored with tablets. This approach is actually already used in the higher level code, implementing the cleanup API. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17296	2024-02-13 16:35:15 +02:00
Petr Gusev	3722ca0a41	sync_raft_topology_nodes: parallelize system_keyspace update functions In sync_raft_topology_nodes we execute a system keyspace update query for each node of the cluster. The system keyspace tables use schema commitlog which by default enables use_o_dsync. This means that each write to the commitlog is accompanied by fsync. For large clusters this can incur hundreds of writes with fsyncs, which is very expensive. For example, in #17039 for a moderate size cluster of 50 nodes sync_raft_topology_nodes took almost 5 seconds. In this commit we solve this problem by running all such update queries in parallel. The commitlog should batch them and issue only one write syscall to the OS. Closes scylladb/scylladb#17243	2024-02-13 14:44:48 +01:00
Piotr Dulikowski	314fd9a11f	test: test_topology_recovery_basic: add missing driver reconnect Unfortunately, scylladb/python-driver#230 is not fixed yet, so it is necessary for the sake of our CI's stability to re-create the driver session after all nodes in the cluster are restarted. There is one place in test_topology_recovery_basic where all nodes are restarted but the driver session is not re-created. Even though nodes are not restarted at once but rather sequentially, we observed a failure with similar symptoms in a CI run for scylla-enterprise. Add the missing driver reconnect as a workaround for the issue. Fixes: scylladb/scylladb#17277 Closes scylladb/scylladb#17278	2024-02-13 12:28:30 +01:00
David Garcia	f45d9d33f1	docs: remove liveness asterisks Instead of adding an asterisk next to "liveness" linking to the glossary, we will temporarily replace them with a hyperlink pending the implementation of tooltip functionality. Closes scylladb/scylladb#17244	2024-02-12 20:37:52 +02:00
Avi Kivity	b22db74e6a	Regenerate frozen toolchain For gnutls 3.8.3 and clang clang-16.0.6-4. Fixes #17285. Closes scylladb/scylladb#17287	2024-02-12 18:36:11 +02:00
Botond Dénes	3f2d7e8b25	tree: remove unnecessary yields around for_each_tablet() Commit `904bafd069` consolidated the two existing for_each_tablet() overloads, to the one which has a future<> returning callback. It also added yields to the bodies of said callbacks. This is unnecessary, the loop in for_each_tablet() already has a yield per tablet, which should be enough to prevent stalls. This patch is a follow-up to #17118 Closes scylladb/scylladb#17284	2024-02-12 17:10:25 +01:00
Kamil Braun	2e81f045cc	Merge 'transport: controller: do_start_server: do not set_cql_read for maintenance port' from Benny Halevy RPC is not ready yet at this point, so we should not set this application state yet. Also, simplify add_local_application_state as it contains dead code that will never generate an internal error after `1d07a596bf`. Fixes #16932 Closes scylladb/scylladb#17263 * github.com:scylladb/scylladb: gossiper: add_local_application_state: drop internae error transport: controller: do_start_server: do not set_cql_read for maintenance port	2024-02-12 13:26:45 +01:00
Pavel Emelyanov	2b1612aa04	main: Stop lifecycle notifier for real It wasn't because of storage service, not the latter is stopped (since `e6b34527c1`), so the former can be stopped to Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#17251	2024-02-12 13:59:50 +02:00
Kefu Chai	7baee379de	sstable/storage: pass fs::path to storage::create_links() this change is a follow-up of `637dd730`. the goal is to use std::filesystem::path for manipulating paths, and to avoid the converting between sstring and fs::path back and forth. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17257	2024-02-12 13:26:11 +02:00
Kefu Chai	7a5cb69e33	storage_service: s/format()/fmt::format/ in the same spirit of `e84a0991`, let's switch the callers who expect std::string to fmt::format(). to minimize the impact and to reduce the risk, the switch will be performed piecemeal. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17253	2024-02-12 13:24:21 +02:00
Pavel Emelyanov	b9721bd397	test/tablets: Decommissioning node below RF is not allowed When a node is decommissioned, all tablet replicas need to be moved away from it. In some cases it may not be possible. If the number of node in the cluster equals the keysapce RF, one cannot decommission any node because it's not possible to find nodes for every replica. The new test case validates this constraint is satisfied. refs: #16195 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#17248	2024-02-12 13:21:47 +02:00
Nadav Har'El	21e7deafeb	alternator, mv: fix case of two new key columns in GSI A materialized view in CQL allows AT MOST ONE view key column that wasn't a key column in the base table. This is because if there were two or more of those, the "liveness" (timestamp, ttl) of these different columns can change at every update, and it's not possible to pick what liveness to use for the view row we create. We made an exception for this rule for Alternator: DynamoDB's API allows creating a GSI whose partition key and range key are both regular columns in the base table, and we must support this. We claim that the fact that Alternator allows neither TTL (Alternator's "TTL" is a different feature) nor user-defined timestamps, does allow picking the liveness for the view row we create. But we did it wrong! We claimed in a comment - and implemented in the code before this patch - that in Alternator we can assume that both GSI key columns will have the same liveness, and in particular timestamp. But this is only true if one modifies both columns together! In fact, in general it is not true: We can have two non-key attributes 'a' and 'b' which are the GSI's key columns, and we can modify only b, without modifying a, in which case the timestamp of the view modification should be b's newer timestamp, not a's older one. The existing code took a's timestamp, assuming it will be the same as b's, which is incorrect. The result was that if we repeatedly modify only b, all view updates will receive the same timestamp (a's old timestamp), and a deletion will always win over all the modifications. This patch includes a reproducing test written by a user (@Zak-Kent) that demonstrates how after a view row is deleted it doesn't get recreated - because all the modifications use the same timestamp. The fix is, as suggested above, to use the higher of the two timestamps of both base-regular-column GSI key columns as the timestamp for the new view rows or view row deletions. The reproducer that failed before this patch passes with it. As usual, the reproducer passes on AWS DynamoDB as well, proving that the test is correct and should really work. Fixes #17119 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17172	2024-02-12 13:17:29 +02:00
Nadav Har'El	341af86167	test/cql-pytest: reproducer for GROUP BY regression This patch adds a simple reproducer for a regression in Scylla 5.4 caused by commit `432cb02`, breaking LIMIT support in GROUP BY. Refs #17237 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17275	2024-02-12 13:09:52 +02:00
Kefu Chai	57df20eef8	configure.py: use un-deprecated module PEP 632 deprecates distutils module, and it is remove from Python 3.12. we are actually using the one vendored by setuptools, if we are using 3.12. so let's use shutil for finding ninja executable. see https://peps.python.org/pep-0632/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17271	2024-02-12 13:05:35 +02:00
Kamil Braun	7d73c40125	Merge 'test.py: tablets: Fix flakiness of test_tablet_missing_data_repair' from Tomasz Grabiec Reimplements stop/start sequence using rolling_restart() which is safe with regards to UP status propagation and not prone to sudden connection drop which may cause later CQL queries to time out. It also ensures that CQL is up on all the remaining nodes when the with_down callback is executed. The test was observed to fail in CI like this: ``` cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.157.135.26:9042 datacenter1>: ConnectionException('Pool for 127.157.135.26:9042 is shutdown')}) ... @pytest.mark.repair @pytest.mark.asyncio async def test_tablet_missing_data_repair(manager: ManagerClient): ... for idx in range(0,3): s = servers[idx].server_id await manager.server_stop_gracefully(s, timeout=120) > await check() ``` Hopefully: Fixes #17107 Closes scylladb/scylladb#17252 * github.com:scylladb/scylladb: test: py: tablets: Fix flakiness of test_tablet_missing_data_repair test: pylib: manager_client: Wait for driver to catch up in rolling_restart() test: pylib: manager_client: Accept callback in rolling_restart() to execute with node down	2024-02-12 11:52:09 +01:00
Botond Dénes	f068d1a6fa	query: do not kill unpaged queries when they reach the tombstone-limit The reason we introduced the tombstone-limit (query_tombstone_page_limit), was to allow paged queries to return incomplete/empty pages in the face of large tombstone spans. This works by cutting the page after the tombstone-limit amount of tombstones were processed. If the read is unpaged, it is killed instead. This was a mistake. First, it doesn't really make sense, the reason we introduced the tombstone limit, was to allow paged queries to process large tombstone-spans without timing out. It does not help unpaged queries. Furthermore, the tombstone-limit can kill internal queries done on behalf of user queries, because all our internal queries are unpaged. This can cause denial of service. So in this patch we disable the tombstone-limit for unpaged queries altogether, they are allowed to continue even after having processed the configured limit of tombstones. Fixes: #17241 Closes scylladb/scylladb#17242	2024-02-12 12:34:04 +02:00
Kefu Chai	9b85d1aebf	configure.py, cmake: do not pass -Wignored-qualifiers explicitly we recently added -Wextra to configure.py, and this option enables a bunch of warning options, including `-Wignored-qualifiers`. so there is no need to enable this specific warning anymore. this change remove ths option from both `configure.py` and the CMake building system. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17272	2024-02-12 12:32:00 +02:00
Avi Kivity	c14571af16	Update seastar submodule Because Seastar now defaults to C++23, we downgrade it explicitly to C++20. * seastar 289ad5e593...5d3ee98073 (10): > Update supported C++ standards to C++23 and C++20 (dropping C++17) > docker: install clang-tools-18 > http: add handler_base::verify_mandatory_params() > coroutine/exception: document return_exception_ptr() > http: use structured-binding when appropriate > test/http: Read full server response before sending next > doc/lambda-coroutine-fiasco: fix a syntax error > util/source_location-compat: use __cpp_consteval > Fix incorrect class name in documentation. > Add support for missing HTTP PATCH method. Closes scylladb/scylladb#17268	2024-02-12 12:21:47 +02:00
Patryk Wrobel	9fccd968d3	test_tablets.py: implement test_tablet_count_metric_per_shard This change introduces a new test that verifies the functionality related to tablet_count metric. It checks if tablet_count metric is correctly reported and updated when new tables are created, when tables are dropped and when `move_tablet` is executed. Refs: scylladb#16131 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com> Closes scylladb/scylladb#17165	2024-02-12 11:49:38 +02:00
Kefu Chai	54995fcac0	test/manual: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17255	2024-02-12 11:49:38 +02:00
Asias He	a0e46a6b47	repair: Fix rpc::source and rpc::optional parameter order in rpc message In a mixed cluster (5.4.1-20231231.3d22f42cf9c3 and 5.5.0~dev-20240119.b1ba904c4977), in the rolling upgrade test, we saw repair never finishing. The following was observed: rpc - client 127.0.0.2:65273 msg_id 5524: caught exception while processing a message: std::out_of_range (deserialization buffer underflow) It turns out the repair rpc message was not compatible between the two versions. Even with a rpc stream verb, the new rpc parameters must come after the rpc::source<> parameter. The rpc::source<> parameter is not special in the sense that it must be the last parameter. For example, it should be: void register_repair_get_row_diff_with_rpc_stream( std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> ( const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_hash_with_cmd> source, rpc::optional<shard_id> dst_cpu_id_opt)>&& func); not: void register_repair_get_row_diff_with_rpc_stream( std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> ( const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::optional<shard_id> dst_cpu_id_opt, rpc::source<repair_hash_with_cmd> source)>&& func); Fixes #16941 Closes scylladb/scylladb#17156	2024-02-12 09:50:30 +02:00
Nadav Har'El	13e16475fa	cql-pytest: fix skipping of tests on Cassandra or old Scylla Recently we added a trick to allow running cql-pytests either with or without tablets. A single fixture test_keyspace uses two separate fixtures test_keyspace_tablets or test_keyspace_vnodes, as requested. The problem is that even if test_keyspace doesn't use its test_keyspace_tablets fixture (it doesn't, if the test isn't parameterized to ask for tablets explicitly), it's still a fixture, and it causes the test to be skipped. This causes every test to be skipped when running on Cassandra or old Scylla which doesn't support tablets. The fix is simple - the internal fixture test_keyspace_tablets should yield None instead of skipping. It is the caller, test_keyspace, which now skips the test if tablets are requested but test_keyspace_tablets is None. Fixes #17266 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17267	2024-02-11 21:03:25 +02:00
Kefu Chai	f990ea9678	tools/scylla-nodetool: implement describecluster Refs #15588 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17240	2024-02-11 20:21:07 +02:00
Avi Kivity	14bf09f447	Merge 'utils: managed_bytes: optimize memory usage for small buffers' from Michał Chojnowski managed_bytes is implemented as chain of blob_storage objects. Each blob_storage contains 24 bytes of metadata. But in the most common case -- when there is only a single element in the chain -- 16 bytes of this metadata is trivial/unused. This is regrettable waste because managed_bytes is used for every database cell in the memtables and cache. It means that every value of size >= 7 bytes (smaller ones fit in the inline storage of managed_bytes) receives 16 bytes of useless overhead. To correct that, this series adds to managed_bytes an alternative storage layout -- used for buffers small enough to fit in one fragment -- which only stores the necessary minimum of metadata. (That is: a pointer to the parent, to facilitate moving the storage during memory defragmentation). This saves 16 bytes on every cell greater than 15 bytes. Which includes e.g. every live cell with value bigger than 6 bytes, which likely applies to most cells. Before: ``` $ build/release/scylla perf-simple-query --duration 10 median 218692.88 tps ( 61.1 allocs/op, 13.1 tasks/op, 41762 insns/op, 0 errors) $ build/release/scylla perf-simple-query --duration 10 --write median 173511.46 tps ( 58.3 allocs/op, 13.2 tasks/op, 53258 insns/op, 0 errors) $ build/release/test/perf/mutation_footprint_test -c1 --row-count=20 --partition-count=100 --data-size=8 --column-count=16 - in cache: 2580222 - in memtable: 2549852 ``` After: ``` $ build/release/scylla perf-simple-query --duration 10 median 218780.89 tps ( 61.1 allocs/op, 13.1 tasks/op, 41763 insns/op, 0 errors) $ build/release/scylla perf-simple-query --duration 10 --write median 173105.78 tps ( 58.3 allocs/op, 13.2 tasks/op, 52913 insns/op, 0 errors) $ build/release/test/perf/mutation_footprint_test -c1 --row-count=20 --partition-count=100 --data-size=8 --column-count=16 - in cache: 2068238 - in memtable: 2037696 ``` Closes scylladb/scylladb#14263 * github.com:scylladb/scylladb: utils: managed_bytes: optimize memory usage for small buffers utils: managed_bytes: rewrite managed_bytes methods in terms of managed_bytes_view	2024-02-11 16:43:40 +02:00
Kefu Chai	cfb2c2c758	db: add formatter for gc_clock::time_point before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `gc_clock::time_point`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17254	2024-02-11 16:39:25 +02:00
Kefu Chai	33224cc10b	sstables/storage: avoid unnecessary type cast the type of `_dir` was changed to fs::path back in `637dd730`, there is no need to cast `_dir` to fs::path anymore. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17256	2024-02-11 16:37:05 +02:00
Benny Halevy	2ed29e31db	gms: inet_address: make constructors explicit In particular, `inet_address(const sstring& addr)` is dangerous, since a function like `topology::get_datacenter(inet_address ep)` might accidentally convert a `sstring` argument into an `inet_address` (which would most likely throw an obscure std::invalid_argument if the datacenter name does not look like an inet_address). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#17260	2024-02-11 15:44:13 +02:00
Benny Halevy	136df58cbc	data_value: delete data_value(T) constructor Currently, since the data_value(bool) ctor is implicit, pointers of any kind are implicitly convertible to data_value via intermediate conversion to `bool`. This is error prone, since it allows unsafe comparison between e.g. an `sstring` with `some` by implicit conversion of both sides to `data_value`. For example: ``` sstring name = "dc1"; struct X { sstring s; }; X x(name); auto p = &x; if (name == p) {} ``` Refs #17261 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#17262	2024-02-11 15:42:55 +02:00
Benny Halevy	f86a5072d6	gossiper: add_local_application_state: drop internae error After `1d07a596bf` that dropped before_change notifications there is no sense in getting the local endpoint_state_ptr twice: before and after the notifications and call on_internal_error if the state isn't found after the notifications. Just throw the runtime_error if the endpoint state is not found, otherwise, use it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-02-11 13:33:26 +02:00
Benny Halevy	ac83df4875	transport: controller: do_start_server: do not set_cql_read for maintenance port RPC is not ready yet at this point, so we should not set this application state yet. This is indicated by the following warning from `gossiper::add_local_application_state`: ``` WARN 2024-01-22 23:40:53,978 [shard 0:stmt] gossip - Fail to apply application_state: std::runtime_error (endpoint_state_map does not contain endpoint = 127.227.191.13, application_states = {{RPC_READY -> Value(1,1)}}) ``` That should really be an internal error, but it can't because of this bug. Fixes #16932 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-02-11 11:49:52 +02:00
Kefu Chai	d7a404e1ec	alternator: add formatter for alternator::calculate_value_caller before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `alternator::calculate_value_caller`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17259	2024-02-11 11:49:46 +02:00
Michał Chojnowski	5a3e4a1cc0	utils: managed_bytes: optimize memory usage for small buffers managed_bytes is implemented as chain of blob_storage objects. Each blob_storage contains 24 bytes of metadata. But in the most common case -- when there is only a single element in the chain -- 16 bytes of this metadata is trivial/unused. This is regrettable waste because managed_bytes is used for every database cell in the memtables and cache. It means that every value of size >= 7 bytes (smaller ones fit in the inline storage of managed_bytes) receives 16 bytes of useless overhead. To correct that, this patch adds to managed_bytes an alternative storage layout -- used for buffers small enough to fit in one contiguous fragment -- which only stores the necessary minimum of metadata. (That is: a pointer to the parent, to facilitate moving the storage during memory defragmentation).	2024-02-09 20:56:20 +01:00
Tomasz Grabiec	1eedc85990	test: py: tablets: Fix flakiness of test_tablet_missing_data_repair Reimplement stop/start sequence using rolling_restart() which is safe with regards to UP status propagation and not prone to sudden connection drop which may cause later CQL queries to time out. It also ensures that CQL is up on all the remaining nodes when the with_down callback is executed. Hopefully: Fixes #17107	2024-02-09 20:37:06 +01:00
Tomasz Grabiec	27ed2d94fc	test: pylib: manager_client: Wait for driver to catch up in rolling_restart() For sanity of the developers who want to execute CQL queries after rolling restarts.	2024-02-09 20:35:41 +01:00
Tomasz Grabiec	3ce4ec796a	test: pylib: manager_client: Accept callback in rolling_restart() to execute with node down	2024-02-09 20:35:41 +01:00
Pavel Emelyanov	7a710425f0	streaming: Open-code on-stack lambda It just wraps one if, no benefit in keeping it this way Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#17250	2024-02-09 20:31:09 +01:00
Petr Gusev	4554653ad9	storage_proxy: add a test for stop_remote This patch adds a reproducer test for an issue #16382. See scylladb/seastar#2044 for details of the problem. The test is enabled only in dev mode since it requires error injection mechanism. The patch adds a new injection into storage_proxy::handle_read to simulate the problem scenario - the node is shutting down and there are some unfinished pending replica requests. Closes scylladb/scylladb#16776	2024-02-09 17:23:13 +01:00
Michał Chojnowski	277a31f0ae	utils: managed_bytes: rewrite managed_bytes methods in terms of managed_bytes_view Some methods of managed_bytes contain the logic needed to read/write the contents of managed_bytes, even though this logic is already present in managed_bytes_{,mutable}_view. Reimplementing those methods by using the views as intermediates allows us to remove some code and makes the responsibilities cleaner -- after the change, managed_bytes contains the logic of allocating and freeing the storage, while views provide read/write access to the storage. This change will simplify the next patch which changes the internals of managed_bytes.	2024-02-09 17:00:33 +01:00
Botond Dénes	ba89b86913	Update tools/java submodule * tools/java c75ce2c1...5e11ed17 (1): > bin/nodetool-wrapper: pass all args to nodetool for testings its ability	2024-02-09 16:34:47 +01:00
Raphael S. Carvalho	daa82f406c	test_tablets: Enable table debug log in split test If the test fails, it's helpful to see how split completion was handled. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17236	2024-02-09 14:38:24 +02:00
Botond Dénes	c7d9708092	Merge 'repair: delete table reference from repair related classes' from Aleksandra Martyniuk row_level_repair and repair_meta keep a reference to a table. If the table is dropped during repair, its object is destructed, leaving a dangling reference. Delete {row_level_repair,repair_meta}::_cf and replace their usages. Fixes: #17233. Closes scylladb/scylladb#17234 * github.com:scylladb/scylladb: repair: delete _cf from repair_meta repair: delete _cf from row_level_repair	2024-02-09 13:16:43 +02:00
Kamil Braun	e9e24f47ec	Merge 'raft topology: implement upgrade and recovery procedure' from Piotr Dulikowski This PR implements a procedure that upgrades existing clusters to use raft-based topology operations. The procedure does not start automatically, it must be triggered manually by the administrator after making sure that no topology operations are currently running. Upgrade is triggered by sending `POST /storage_service/raft_topology/upgrade` request. This causes the topology coordinator to start who drives the rest of the process: it builds the `system.topology` state based on information observed in gossip and tells all nodes to switch to raft mode. Then, topology coordinator runs normally. Upgrade progress is tracked in a new static column `upgrade_state` in `system.topology`. The procedure also serves as an extension to the current recovery procedure on raft. The current recovery procedure requires restarting nodes in a special mode which disables raft, perform `nodetool removenode` on the dead nodes, clean up some state on the nodes and restart them so that they automatically rebuild the group 0. Raft topology fits into existing procedure by falling back to legacy topology operations after disabling raft. After rebuilding the group 0, upgrade needs to be triggered again. Because upgrade is manual and it might not be convenient for administrators to run it right after upgrading the cluster, we allow the cluster to operate in legacy topology operations mode until upgrade, which includes allowing new nodes to join. In order to allow it, nodes now ask the cluster about the mode they should use to join before proceeding by using a new `JOIN_NODE_QUERY` RPC. The procedure is explained in more detail in `topology-over-raft.md`. Fixes: https://github.com/scylladb/scylladb/issues/15008 Closes scylladb/scylladb#17077 * github.com:scylladb/scylladb: test/topology_custom: upgrade/recovery tests for topology on raft cdc/generation_service: in legacy mode, fall back to raft tables system_keyspace: add read_cdc_generation_opt cdc/generation_service: turn off gossip notifications in raft topo mode cql_test_env: move raft_topology_change_enabled var earlier group0_state_machine: pull snapshot after raft topology feature enabled storage_service: disable persistent feature enabler on upgrade storage_service: replicate raft features to system.peers storage_service: gossip tokens and cdc generation in raft topology mode API: add api for triggering and monitoring topology-on-raft upgrade storage_service: infer which topology operations to use on startup storage_service: set the topology kind value based on group 0 state raft_group0: expose link to the upgrade doc in the header feature_service: fall back to checking legacy features on startup storage_service: add fiber for tracking the topology upgrade progress gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES topology_coordinator: implement core upgrade logic topology_coordinator: extract top-level error handling logic storage_service: initialize discovery leader's state earlier topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data topology_state_machine: introduce upgrade_state storage_service: disallow topology ops when upgrade is in progress raft_group0_client: add in_recovery method storage_service: introduce join_node_query verb raft_group0: make discover_group0 public raft_group0: filter current node's IP in discover_group0 raft_group0: remove my_id arg from discover_group0 storage_service: make _raft_topology_change_enabled more advanced docs: document raft topology upgrade and recovery	2024-02-09 11:54:53 +01:00
Kefu Chai	c1c96bbc16	api/storage_service: drop /storage_service/describe_ring/ API per its description, "`/storage_service/describe_ring/`" returns the token ranges of an arbitrary keyspace. actually, it returns the first keyspace which is of non-local-vnode-based-strategy. this API is not used by nodetool, neither is it exercised in dtest. scylla-manager has a wrapper for this API though, but that wrapper is not used anywhere. in this change, this API is dropped. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17197	2024-02-09 12:49:21 +02:00
Kefu Chai	c07de1fad1	topology_coordinator: s/sate/state/ fix a typo in the logging message. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17201	2024-02-09 10:27:33 +01:00
Kefu Chai	876478b84f	storage_service: allow concurrent tablet migration in tablets/move API Currently it waits for topology state machine to be idle, so it allows one tablet to be moved at a time. We should allow it to start migration if the current transition state is - topology::transition_state::tablet_migration or - topology::transition_state::tablet_draining to allow starting parallel tablet movement. That will be useful when scripting a custom rebalancing algorithm. in this change, we wait until the topology state machine is idle or it is at either of the above two states. Fixes #16437 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17203	2024-02-08 21:47:15 +01:00
Piotr Dulikowski	4d4976feb0	test/topology_custom: upgrade/recovery tests for topology on raft Adds three tests for the new upgrade procedure: - test_topology_upgrade - upgrades a cluster operating in legacy mode to use raft topology operations, - test_topology_recovery_basic - performs recovery on a three-node cluster, no node removal is done, - test_topology_majority_loss - simulates a majority loss scenario, i.e. removed two nodes out of three, performs recovery to rebuild the raft topology state and re-add two nodes back.	2024-02-08 19:12:28 +01:00
Piotr Dulikowski	d04b3338ce	cdc/generation_service: in legacy mode, fall back to raft tables When a node enters recovery after being in raft topology mode, topology operations switch back to legacy mode. We want CDC to keep working when that happens, so we need for the legacy code to be able to access generations created back in raft mode - so that the node can still properly serve writes to CDC log tables. In order to make this possible, modify the legacy logic to also look for a cdc generation in raft tables, if it is not found in legacy tables.	2024-02-08 19:12:28 +01:00
Piotr Dulikowski	fb02453686	system_keyspace: add read_cdc_generation_opt The `system_keyspace::read_cdc_generation` loads a cdc generation from the system tables. One of its preconditions is that the generation exists - this precondition is quite easy to satisfy in raft mode, and the function was designed to be used solely in that mode. In legacy mode however, in case when we revert from raft mode through recovery, it might be necessary to use generations created in raft mode for some time. In order to make the function useful as a fallback in case lookup of a generation in legacy mode fails, introduce a relaxed variant of `read_cdc_generation` which returns std::nullopt if the generation does not exist.	2024-02-08 19:12:28 +01:00
Piotr Dulikowski	77a8f5e3d6	cdc/generation_service: turn off gossip notifications in raft topo mode In raft topology mode CDC information is propagated through group 0. Prevent the generation service from reacting to gossiper notifications after we made the switch to raft mode.	2024-02-08 19:12:28 +01:00

1 2 3 4 5 ...

41226 Commits