scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 13:37:04 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	862b6e61a4	topology_coordinator: handle joining nodes The topology coordinator is updated to perform verification of joining nodes and to send `JOIN_NODE_RESPONSE` RPC back to the joining node.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	5ba2bfa015	topology_state_machine: add join_group0 state Currently, when the topology coordinator notices a request to join or replace a node, the node is transitioned to an appropriate state and the topology is moved to commit_new_generation/write_both_read_old, in a single group 0 operation. In later commits, the topology coordinator will accept/reject nodes based on the request, so we would like to have a separate step - topology coordinator accepts, transitions to bootstrap state, tells the node that it is accepted, and only then continues with the topology transition. This commits adds a new `join_group0` transition state that precedes `commit_cdc_generation`.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	bb40c2a8b8	storage_service: add join node RPC handlers	2023-09-27 15:53:13 +02:00
Piotr Dulikowski	64668e325e	raft: expose current_leader in raft::server The handler for join_node_request will need to know which node is considered the group 0 leader right now by the local node. If the topology coordinator crashes and a new node immediately wants to replace it with the same IP, the node that handles join_node_request will attempt to perform a read barrier. If this happens quickly enough, due to the IP reuse the RPC will be sent to the new node instead of the (now crashed) topology coordinator; the RPC will get an error and will fail the barrier. If we detect that the new node wants to replace the current topology coordinator, the upcoming join_node_request_handler will wait until there is a leader change.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	74b01730b4	storage_service: extract wait_for_live_nodes_timeout constant Like in the non-raft topology path, during the new handshake, the joining node will wait until all normal nodes are alive. The timeout used during the wait is extracted to a constant so that it will be reused in the handshake code, to be introduced in later commits.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	4f82f9fe50	raft_group0: abstract out node joining handshake Currently, the raft_group0 uses GROUP0_MODIFY_CONFIG RPC to ask an existing group 0 member to add this node to the group, in case the joining node was not a discovery leader. The new handshake verbs (JOIN_NODE_REQUEST + JOIN_NODE_RESPONSE) will replace the old RPC. As a preparation, this commit abstracts away the handshake process.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	c24daf7e88	storage_service: pass raft_topology_change_enabled on rpc init We will want to conditionally register some verbs based on whether we are using raft topology or not. This commit serves as a preparation, passing the `raft_topology_change_enabled` to the function which initializes the verbs (although there is _raft_topology_change_enabled field already, it's only initialized on shard 0 later).	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	7cbe5e3af8	rpc: add new join handshake verbs The `join_node_request` and `join_node_response` RPCs are added: - `join_node_request` is sent from the joining node to any node in the cluster. It contains some initial parameters that will be verified by the receiving node, or the topology coordinator - notably, it contains a list of cluster features supported by the joining node. - `join_node_response` is sent from the topology coordinator to the joining node to tell it about the the outcome of the verification.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	dd4579637b	docs: document the new join procedure	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	caf1d4938e	topology_state_machine: add supported_features to replica_state The `service::topology_features` struct was introduced in #14955. Its purpose was to make it possible to load cluster features from `system.topology` before schema commitlog replay. It contains a map from host ID to supported feature set for every normal node. In order not to duplicate logic for loading features, the `service::topology`'s `replica_state`s do not hold a set of supported features and users are supposed to refer to the features in `topology_features`, which is a field in the `topology` struct. However, accessing features is quite awkward now. This commit adds `supported_features` field back to the `replica_state` struct and the `load_topology_state` function initializes them properly. The logic duplication needed to initialize them is quite small and the drawbacks that come with it are outweighed by the fact that we now can refer to node's supported features in a more natural way. The `topology_features` struct is no longer a field of `topology`, but it still exists for the purpose of the feature check that happens before commitlog replay.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	51b0e4d44f	storage_service: check destination host ID in raft verbs In unlucky but possible circumstances where a node is being replaced very quickly, RPC requests using raft-related verbs from storage_service might be sent to it - even before the node starts its group 0 server. In the latter case, this triggers on_internal_error. This commit adds protection to the existing verbs in storage_service: they check whether the group 0 is running and whether the received host_id matches the actual recipient's host_id. None of the verbs that are modified are in any existing release, so the added parameter does not have to be wrapped in rpc::optional.	2023-09-26 15:56:51 +02:00
Piotr Dulikowski	0317705f5a	group_state_machine: take reference to raft address map It will be needed to translate host ids to addresses.	2023-09-26 15:46:25 +02:00
Piotr Dulikowski	193e8eba26	raft_group0: expose joined_group0 It will be needed in the next commit to check whether the group 0 server has been started.	2023-09-26 15:46:25 +02:00
Tomasz Grabiec	0f22e8d196	storage_service: Fixed missed notificaiton on tablet metadata update There can be 2 waiters now (coordinator and CDC generation publisher), so signal() is not enough. Change made in `c416c9ff33` missed to update this site. Closes scylladb/scylladb#15527	2023-09-26 10:37:57 +02:00
Jan Ciolek	e5f0468761	cql/prepare_expr: fix wrong receiver in field_selection_test_assignment When preparing a `field_selection`, we need to prepare the UDT value, and then verify that it has this field. `field_selection_test_assignment` prepares the UDT value using the same receiver as the whole `field_selection`. This is wrong, this receiver has the type of the field, and not the UDT. It's impossible to create a receiver for the UDT. Many different UDTs can produce an `int` value when the field `a` is selected. Therefore the receiver should be `nullptr`. No unit test is added, as this bug doesn't currently cause any issues. Preparing a column value doesn't do any type checks, so nothing fails. Still it's good to fix it, just to be correct. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes scylladb/scylladb#14788	2023-09-26 11:15:00 +03:00
Tomasz Grabiec	19ff4b730f	storage_service: Avoid SIGSEGV when tablet cleanup is invoked on non-0 shard We access group0, which is only set on shard 0. Closes scylladb/scylladb#15469	2023-09-25 20:59:27 +03:00
Pavel Emelyanov	901bbf21e9	Merge 'build: extract code fragments into functions' from Kefu Chai more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs https://github.com/scylladb/scylladb/issues/15379 Closes scylladb/scylladb#15515 * github.com:scylladb/scylladb: build: extract get_os_ids() out build: extract find_ninja() out build: extract thrift_uses_boost_share_ptr() out	2023-09-25 20:57:59 +03:00
Botond Dénes	caeddb9c88	tools/utils: return a distinct error-code on unknown operation Currently, the tools loosely follow the following convention on error-codes: * return 1 if the error is with any of the command-line arguments * return 2 on other errors This patch changes the returned error-code on unknown operation/command to 100 (instead of the previous 1). The intent is to allow any wrapper script to determine that the tool failed because the operation is unrecognized and not because of something else. In particular this should enable us to write a wrapper script for scylla-nodetool, which dispatches commands still un-implemented in scylla-nodetool, to the java nodetool. Note that the tool will still print an error message on an unknown operation. So such wrapper script would have to make sure to not let this bleed-through when it decides to forward the operation. Closes scylladb/scylladb#15517	2023-09-25 20:56:44 +03:00
Anna Stuchlik	4afe2b9d9f	doc: add RBNO to glossary This commit adds Repair Based Node Operations to the ScyllaDB glossary. Fixes https://github.com/scylladb/scylladb/issues/11959 Closes scylladb/scylladb#15522	2023-09-25 18:16:53 +03:00
Pavel Emelyanov	652153c291	Merge 'populate_keyspace: use datadir' from Benny Halevy Currently the datadir is ignored. Use it to construct the table's base path. Fixes scylladb/scylladb#15418 Closes scylladb/scylladb#15480 * github.com:scylladb/scylladb: distributed_loader: populate_keyspace: access cf by ref distributed_loader: table_populator: use datadir for base_path distributed_loader: populate_keyspace: issue table mark_ready_for_writes after all datadirs are processed distributed_loader: populate_keyspace: fixup indentation distributed_loader: populate_keyspace: iterate over datadirs in the inner loop test: sstable_directory_test: add test_multiple_data_dirs table: init_storage: create upload and staging subdirs on all datadirs	2023-09-25 13:40:50 +03:00
Nadav Har'El	1a5debac5c	test/cql-pytest: cleaner reproducer for spurious static row returned Issue #10357 is about a SELECT with a filter on a regular column which incorrectly returns a static row without regular columns set (so the filter would not have matched). We already have four tests reproducing this issue, but each of them is a small part of a large tests translated from Cassandra, making it hard to understand the scope of this bug. So in this patch we add two new tests, one passing and one xfailing, which clarify the scope of this bug. It turns out that the bug only occurs when a partition has no clustering rows and only has a static row. If the partition does have clustering rows - even if those don't match the filter - the bug doesn't happen. The xfailing test is just two statements long - a single INSERT and a single SELECT Refs #10357. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#15120	2023-09-25 11:01:22 +03:00
Raphael S. Carvalho	914cbc11cf	reader_concurrency_semaphore: Fix stop() in face of evictable reads becoming inactive Scylla can crash due to a complicated interaction of service level drop, evictable readers, inactive read registration path. 1) service level drop invoke stop of reader concurrency semaphore, which will wait for in flight requests 2) turns out it stops first the gate used for closing readers that will become inactive. 3) proceeds to wait for in-flight reads by closing the reader permit gate. 4) one of evictable reads take the inactive read registration path, and finds the gate for closing readers closed. 5) flat mutation reader is destroyed, but finds the underlying reader was not closed gracefully and triggers the abort. By closing permit gate first, evictable readers becoming inactive will be able to properly close underlying reader, therefore avoiding the crash. Fixes #15534. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15535	2023-09-25 08:55:50 +03:00
Nadav Har'El	be942c1bce	Merge 'treewide: rename s3 credentials related variable and option names' from Kefu Chai in this series, we rename s3 credential related variable and option names so they are more consistent with AWS's official document. this should help with the maintainability. Closes scylladb/scylladb#15529 * github.com:scylladb/scylladb: main.cc: rename aws option utils/s3/creds: rename aws_config member variables	2023-09-24 14:03:47 +03:00
Nadav Har'El	4e1e7568d8	Merge 'cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe' from Michał Jadwiszczak So far generic describe (`DESC <name>`) followed Cassandra implementation and it only described keyspace/table/view/index. This commit adds UDT/UDF/UDA to generic describe. Fixes: #14170 Closes scylladb/scylladb#14334 * github.com:scylladb/scylladb: docs:cql: add information about generic describe cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe	2023-09-24 13:03:04 +03:00
Kefu Chai	f3f31f0c65	main.cc: rename aws option - s/aws_key/aws_access_key_id/ - s/aws_secret/aws_secret_access_key/ - s/aws_token/aws_session_token/ rename them to more popular names, these names are also used by boto's API. this should improve the readability and consistency. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-23 14:31:32 +08:00
Kefu Chai	ac3406e537	utils/s3/creds: rename aws_config member variables - s/key/access_key_id/ - s/secret/secret_access_key/ - s/token/session_token/ so they are more aligned with the AWS document. for instance, in https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAuthentication.html#ConstructingTheAuthenticationHeader AWSAccessKeyId is used in the "Authorization" header. this would help with the readability and maintainability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-23 14:28:07 +08:00
Benny Halevy	7bd131d212	distributed_loader: populate_keyspace: access cf by ref There is no need to hold on to the table's shared ptr since it's held by the global table ptr we got in the outer loop. Simplify the code by just getting the local table reference from `gtable`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:51:41 +03:00
Benny Halevy	a8e7981bb6	distributed_loader: table_populator: use datadir for base_path Currently the datadir is ignored. Use it to construct the table's base path. Fixes scylladb/scylladb#15418 Note that scylla still doesn't work correctly with multiple data directories due to #15510. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:51:39 +03:00
Benny Halevy	14da3e4218	distributed_loader: populate_keyspace: issue table mark_ready_for_writes after all datadirs are processed Currently, mark_ready_for_writes is called too early, after the first data dir is processed, then the next datadir will hit an assert in `table::mark_ready_for_writes`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:50:53 +03:00
Benny Halevy	84510370e1	distributed_loader: populate_keyspace: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:50:52 +03:00
Benny Halevy	87d438b234	distributed_loader: populate_keyspace: iterate over datadirs in the inner loop It is more efficient to iterate over multiple data directories in the inner loop rather than the outer loop. Following patch will make use of the datadir in table_populator. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:50:24 +03:00
Benny Halevy	2591f5f935	test: sstable_directory_test: add test_multiple_data_dirs Add a basic regression test that starts the cql test env with multiple data directories. It fails without the previous patch: table: init_storage: create upload and staging subdirs on all datadirs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:24:54 +03:00
Benny Halevy	2937552e5b	table: init_storage: create upload and staging subdirs on all datadirs We need to have a complete directory structure for each table and each configured datadir. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:24:54 +03:00
David Garcia	762ca61ad9	docs: format db reference as list docs: limit reference max_depth docs: change reference description order Closes scylladb/scylladb#15205	2023-09-22 19:25:01 +03:00
Kamil Braun	99d83808cc	Merge 'test/topology_custom/test_select_from_mutation_fragments.py: use async api and clean-up' from Botond Dénes Also, while at it, add copyright/license blurbs for tests that were missing it. Closes scylladb/scylladb#15495 * github.com:scylladb/scylladb: test/topology_custom: add copyright/license blurb to tests test/topology_custom: test_select_from_mutation_fragments.py: use async query api	2023-09-22 10:59:48 +02:00
Botond Dénes	91a8100b3f	Merge 'Validate compaction strategy options in prepare' from Aleksandra Martyniuk Table properties validation is performed on statement execution. Thus, when one attempts to create a table with invalid options, an incorrect command gets committed in Raft. But then its application fails, leading to a raft machine being stopped. Check table properties when create and alter statements are prepared. Fixes: #14710. Closes scylladb/scylladb#15091 * github.com:scylladb/scylladb: cql3: statements: delete execute override cql3: statements: call check_restricted_table_properties in prepare cql3: statements: pass data_dictionary::database to check_restricted_table_properties	2023-09-22 09:49:19 +03:00
Kefu Chai	be7363a621	build: extract get_os_ids() out this helper is only used by pkgname(), so move it closer to its sole caller. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-22 14:12:12 +08:00
Kefu Chai	0af50b2709	build: extract find_ninja() out more structured this way. and the data dependency is more clear with this change. this also allows us to quickly identify the parts which should/can be reused when migrating to the CMake based building system. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-22 13:23:45 +08:00
Kefu Chai	2e901bae2f	build: extract thrift_uses_boost_share_ptr() out more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-22 13:23:45 +08:00
Michael Huang	a684e51e4d	cql3: fix bad optional access when executing fromJson function Fix fromJson(null) to return null, not a error as it did before this patch. We use "null" as the default value when unwrapping optionals to avoid bad optional access errors. Fixes: scylladb#7912 Signed-off-by: Michael Huang <michaelhly@gmail.com> Closes scylladb/scylladb#15481	2023-09-21 20:18:49 +03:00
Avi Kivity	61440d20c3	Merge 'Enable incremental compaction on off-strategy' from Raphael "Raph" Carvalho Off-strategy suffers with a 100% space overhead, as it adopted a sort of all or nothing approach. Meaning all input sstables, living in maintenance set, are kept alive until they're all reshaped according to the strategy criteria. Input sstables in off-strategy are very likely to be mostly disjoint, so it can greatly benefit from incremental compaction. The incremental compaction approach is not only good for decreasing disk usage, but also memory usage (as metadata of input and output live in memory), and file desc count, which takes memory away from OS. Turns out that this approach also greatly simplifies the off-strategy impl in compaction manager, as it no longer have to maintain new unused sstables and mark them for deletion on failure, and also unlink intermediary sstables used between reshape rounds. Fixes https://github.com/scylladb/scylladb/issues/14992. Closes scylladb/scylladb#15400 * github.com:scylladb/scylladb: test: Verify that off-strategy can do incremental compaction compaction: Clear pending_replacement list when tombstone GC is disabled compaction: Enable incremental compaction on off-strategy compaction: Extend reshape type to allow for incremental compaction compaction: Move reshape_compaction in the source compaction: Enable incremental compaction only if replacer callback is engaged	2023-09-21 20:12:19 +03:00
Gleb Natapov	c94a9cf731	storage_service: raft topology: fence off write from old topology coordinator before starting a new one Make sure that all writes started by the old coordinator are completed or will eventually fail before starting a new coordinator. Message-ID: <ZQv+OCrHl+KyAnvv@scylladb.com>	2023-09-21 17:26:45 +02:00
Avi Kivity	1da6a939fe	Merge 'Track memory usage of S3 object uploads' from Pavel Emelyanov The S3 uploading sink needs to collect buffers internally before sending them out, because the minimal upload-able part size is 5Mb. When the necessary amount of bytes is accumulated, the part uploading fibers starts in the background. On flush the sink waits for all the fibers to complete and handles failure of any. Uploading parallelism is nowadays limited by the means of the http client max-connections parameter. However, when a part uploading fibers waits for it connection it keeps the 5Mb+ buffers on the request's body, so even though the number of uploading parts is limited, the number of _waiting_ parts is effectively not. This PR adds a shard-wide limiter on the number of background buffers S3 clients (and theirs http clients) may use. Closes scylladb/scylladb#15497 * github.com:scylladb/scylladb: s3::client: Track memory in client uploads code: Configure s3 clients' memory usage s3::client: Construct client with shared semaphore sstables::storage_manager: Introduce config	2023-09-21 18:24:42 +03:00
Botond Dénes	a0c5dee2aa	utils/logalloc: introduce logalloc::bad_alloc This new exception type inherits from std::bad_alloc and allows logalloc code to add additional information about why the allocation failed. We currently have 3 different throw sites for std::bad_alloc in logalloc.cc and when investigating a coredump produced by --abort-on-lsa-bad-alloc, it is impossible to determine, which throw-site activated last, triggering the abort. This patch fixes that by disambiguating the throw-sites and including it in the error message printed, right before abort. Refs: #15373 Closes scylladb/scylladb#15503	2023-09-21 17:43:53 +03:00
Raphael S. Carvalho	91efd878d7	test: Verify that off-strategy can do incremental compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:15:46 -03:00
Raphael S. Carvalho	9d92374b20	compaction: Clear pending_replacement list when tombstone GC is disabled pending_replacement list is used by incremental compaction to communicate to other ongoing compactions about exhausted sstables that must be replaced in the sstable set they keep for tombstone GC purposes. Reshape doesn't enable tombstone GC, so that list will not be cleared, which prevents incremental compaction from releasing sstables referenced by that list. It's not a problem until now where we want reshape to do incremental compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:15:46 -03:00
Raphael S. Carvalho	42050f13a0	compaction: Enable incremental compaction on off-strategy Off-strategy suffers with a 100% space overhead, as it adopted a sort of all or nothing approach. Meaning all input sstables, living in maintenance set, are kept alive until they're all reshaped according to the strategy criteria. Input sstables in off-strategy are very likely to mostly disjoint, so it can greatly benefit from incremental compaction. The incremental compaction approach is not only good for decreasing disk usage, but also memory usage (as metadata of input and output live in memory), and file desc count, which takes memory away from OS. Turns out that this approach also greatly simplifies the off-strategy impl in compaction manager, as it no longer have to maintain new unused sstables and mark them for deletion on failure, and also unlink intermediary sstables used between reshape rounds. Fixes #14992. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:15:46 -03:00
Raphael S. Carvalho	db9ce9f35a	compaction: Extend reshape type to allow for incremental compaction That's done by inheriting regular_compaction, which implement incremental compaction. But reshape still implements its own methods for creating writer and reader. One reason is that reshape is not driven by controller, as input sstables to it live in maintenance set. Another reason is customization of things like sstable origin, etc. stop_sstable_writer() is extended because that's used by regular_compaction to check for possibility of removing exhausted sstables earlier whenever an output sstable is sealed. Also, incremental compaction will be unconditionally enabled for ICS/LCS during off-strategy. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:15:12 -03:00
Raphael S. Carvalho	33a0f42304	compaction: Move reshape_compaction in the source That's in preparation to next change that will make reshape inherit from regular compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:11:13 -03:00
Botond Dénes	3b95f4f107	Merge 'Sanitize view-update-generator start-stop sequence' from Pavel Emelyanov The v.u.g. start stop is now spread over main() code heavily. 1. sharded<v.u.g.>.start() happens early enough to allow depending services register staging sstables on it 2. after the system is "more-or-less" alive the invoke_on_all(v.u.g.::start()) is called (conditionally) to activate the generator background fiber. Not 100% sure why it happens _that_ late, but somehow it's required that while scylla is joining the cluster the generation doesn't happen 3. early on stop the v.u.g. is fully stopped The 3rd step is pretty nasty. It may happen that v.u.g. is not stopped if scylla start aborts before the last action is defer-scheduled. Also, when it happens, it leaves stopping dependencies with non-initialized v.u.g.'s local instances, which is not symmetrical to how they start. Said that, this PR fixes the stopping sequence to happen later, i.e. -- being defer-scheduled right after sharded<v.u.g.> is started. Also it makes sure that terminating the background fiber happens as early as it is now. This is done the compaction_manager-style -- the v.u.g. subscribes on stop signal abort source and kicks the fiber to stop when it fires. Closes scylladb/scylladb#15466 * github.com:scylladb/scylladb: view_update_generator: Stop for real later view_update_generator: Add logging to do_abort() view_update_generator: Move abort kicking to do_abort() view_update_generator: Add early abort subscription	2023-09-21 17:01:27 +03:00

1 2 3 4 5 ...

39071 Commits