scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-08 16:03:20 +00:00

Author	SHA1	Message	Date
Kefu Chai	a5e696fab8	storage_service, test: drop unused storage_service_config this setting was removed back in `dcdd207349`, so despite that we are still passing `storage_service_config` to the ctor of `storage_service`, `storage_service::storage_service()` just drops it on the floor. in this change, `storage_service_config` class is removed, and all places referencing it are updated accordingly. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes #11415	2022-08-31 19:49:13 +03:00
Avi Kivity	421557b40a	Merge "Provide DC/RACK when populating topology" from Pavel E " The topology object maintains all sort of node/DC/RACK mappings on board. When new entries are added to it the DC and RACK are taken from the global snitch instance which, in turn, checks gossiper, system keyspace and its local caches. This set make topology population API require DC and RACK via the call argument. In most of the cases the populating code is the storage service that knows exactly where to get those from. After this set it will be possible to remove the dependency knot consiting of snitch, gossiper, system keyspace and messaging. " * 'br-topology-dc-rack-info' of https://github.com/xemul/scylla: toplogy: Use the provided dc/rack info test: Provide testing dc/rack infos storage_service: Provide dc/rack for snitch reconfiguration storage_service: Provide dc/rack from system ks on start storage_service: Provide dc/rack from gossiper for replacement storage_service: Provide dc/rack from gossiper for remotes storage_service,dht,repair: Provide local dc/rack from system ks system_keyspace: Cache local dc-rack on .start() topology: Some renames after previous patch topology: Require entry in the map for update_normal_tokens() topology: Make update_endpoint() accept dc-rack info replication_strategy: Accept dc-rack as get_pending_address_ranges argument dht: Carry dc-rack over boot_strapper and range_streamer storage_service: Make replacement info a real struct	2022-08-31 12:53:06 +03:00
Kamil Braun	6c16ae4868	Merge 'raft, limit for command size' from Gusev Petr Commitlog imposes a limit on the size of mutations and throws an exception if it's exceeded. In case of schema changes before raft this exception was delivered to the client. Now it happens while saving the raft command in io_fiber in persistence->store_log_entries and what the client gets is just a timeout exception, which doesn't say much about the cause of the problem. This patch introduces an explicit command size limit and provides a clear error message in this case. Closes #11318 * github.com:scylladb/scylladb: raft, use max_command_size to satisfy commitlog limit raft, limit for command size	2022-08-26 12:20:58 +02:00
Pavel Emelyanov	f6abc3f759	storage_service: Provide dc/rack for snitch reconfiguration When snitch reconfigures (gossiper-property-file one) it kicks storage service so that it updates itself. This place also needs to update the dc/rack info about itself, the correct (new) values are taken from the snitch itself. There's a bug here -- system.local table it not update with new data until restart. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:58:34 +03:00
Pavel Emelyanov	f8614fe039	storage_service: Provide dc/rack from system ks on start When a node starts it loads the information about peers from system.peers table and populates token metadata and topology with this information. The dc/rack are taken from the sys-ks cache here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:57:15 +03:00
Pavel Emelyanov	5d5782a086	storage_service: Provide dc/rack from gossiper for replacement When a node it started to replace another node it updates token metadata and topology with the target information eary. The tokens are now taken from gossiper shadow round, this patch makes the same for dc/rack info. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:55:31 +03:00
Pavel Emelyanov	6b70358616	storage_service: Provide dc/rack from gossiper for remotes When a node is notified about other nodes state change it may want to update the topology information about it. In all those places the dc/rack into about the peer is provided by the gossiper. Basically, these updates mirror the relevant updates of tokens on the token metadata object. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:53:54 +03:00
Pavel Emelyanov	43e83c5415	storage_service,dht,repair: Provide local dc/rack from system ks When a node starts it adds itself to the topology. Mostly it's done in the storage_service::join_cluster() and whoever it calls. In all those places the dc/rack for the added node is taken from the system keyspace (it's cache was populated with local dc/rack by the previous patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:52:16 +03:00
Pavel Emelyanov	4cbe6ee9f4	topology: Require entry in the map for update_normal_tokens() The method in question tries to be on the safest side and adds the enpoint for which it updates the tokens into the topology. From now on it's up to the caller to put the endpoint into topology in advance. So most of what this patch does is places topology.update_endpoint() into the relevant places of the code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:44:08 +03:00
Pavel Emelyanov	5fc9854eae	topology: Make update_endpoint() accept dc-rack info The method in question populates topology's internal maps with endpoint vs dc/rack relations. As for today the dc/rack values are taken from the global snitch object (which, in turn, goes to gossiper, system keyspace and its internal non-updateable cache for that). This patch prepares the ground for providing the dc/rack externally via argument. By now it's just and argument with empty strings, but next patches will populate it with real values (spoiler: in 99% it's storage service that calls this method and each call will know where to get it from for sure) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:41:09 +03:00
Pavel Emelyanov	360c4f8608	dht: Carry dc-rack over boot_strapper and range_streamer Both classes may populate (temporarly clones of) token metadata object with endpoint:tokens pairs for the endpoint they work with. Next patches will require that endpoint comes with the dc/rack info. This patch makes sure dht classes have the necessary information at hand (for now it's just empty pair of strings). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:37:02 +03:00
Pavel Emelyanov	c7a3fed225	storage_service: Make replacement info a real struct This is to extend it in one of the next patches Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:36:16 +03:00
Kamil Braun	e350e37605	service/raft: raft_group0: implement upgrade procedure A listener is created inside `raft_group0` for acting when the SUPPORTS_RAFT feature is enabled. The listener is established after the node enters NORMAL status (in `raft_group0::finish_setup_after_join()`, called at the end of `storage_service::join_cluster()`). The listener starts the `upgrade_to_group0` procedure. The procedure, on a high level, works as follows: - join group 0 - wait until every peer joined group 0 (peers are taken from `system.peers` table) - enter `synchronize` upgrade state, in which group 0 operations are disabled (see earlier commit which implemented this logic) - wait until all members of group 0 entered `synchronize` state or some member entered the final state - synchronize schema by comparing versions and pulling if necessary - enter the final state (`use_new_procedures`), in which group 0 is used for schema operations (only those for now). The devil lies in the details, and the implementation is ugly compared to this nice description; for example there are many retry loops for handling intermittent network failures. Read the code. `leave_group0` and `remove_group0` were adjusted to handle the upgrade procedure being run correctly; if necessary, they will wait for the procedure to finish. If the upgrade procedure gets stuck (and it may, since it requires all nodes to be available to contact them to correctly establish a single group 0 raft cluster); or if a running cluster permanently loses a majority of nodes, causing group 0 unavailability; the cluster admin is not left without help. We introduce a recovery mode, which allows the admin to completely get rid of traces of existing group 0 and restart the upgrade procedure - which will establish a new group 0. This works even in clusters that never upgraded but were bootstrapped using group 0 from scratch. To do that, the admin does the following on every node: - writes 'recovery' under 'group0_upgrade_state' key in `system.scylla_local` table, - truncates the `system.discovery` table, - truncates the `system.group0_history` table, - deletes group 0 ID and group 0 server ID from `system.scylla_local` (the keys are `raft_group0_id` and `raft_server_id` then the admin performs a rolling restart of their cluster. The nodes restart in a "group 0 recovery mode", which simply means that the nodes won't try to perform any group 0 operations. Then the admin calls `removenode` to remove the nodes that are down. Finally, the admin removes the `group0_upgrade_state` key from `system.scylla_local`, rolling-restarts the cluster, and the cluster should establish group 0 anew. Note that this recovery procedure will have to be extended when new stuff is added to group 0 - like topology change state. Indeed, observe that a minority of nodes aren't able to receive committed entries from a leader, so they may end up in inconsistent group 0 states. It wouldn't be safe to simply create group 0 on those nodes without first ensuring that they have the same state from which group 0 will start. Right now the state only consist of schema tables, and the upgrade procedure ensures to synchronize them, so even if the nodes started in inconsistent schema states, group 0 will correctly be established. (TODO: create a tracking issue? something needs to remind us of this whenever we extend group 0 with new stuff...)	2022-08-23 13:51:01 +02:00
Petr Gusev	aa88d58539	raft, use max_command_size to satisfy commitlog limit Commitlog imposes a limit on the size of mutations and throws an exception if it's exceeded. In case of schema changes before raft this exception was delivered to the client. Now it happens while saving the raft command in io_fiber in persistence->store_log_entries and what the client gets is just a timeout exception, which doesn't say much about the cause of the problem. This patch introduces an explicit command size limit and provides a clear error message in this case.	2022-08-23 12:09:32 +04:00
Kamil Braun	2ba1fb0490	service/raft: raft_group0: extract `tracker` from `persistent_discovery::run` Extract it to a top-level abstraction, write comments. It will be reused in the following commit.	2022-08-19 19:15:19 +02:00
Kamil Braun	f7e02a7de9	service/raft: raft_group0: introduce local loggers for group 0 and upgrade	2022-08-19 19:15:19 +02:00
Kamil Braun	ac5f4248a9	service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb During the upgrade procedure nodes will want to obtain the upgrade state of other nodes to proceed. This is what the new verb is for.	2022-08-19 19:15:19 +02:00
Kamil Braun	43687be1f1	service/raft: raft_group0_client: prepare for upgrade procedure Now, whether an 'group 0 operation' (today it means schema change) is performed using the old or new methods, doesn't depend on the local RAFT fature being enabled, but on the state of the upgrade procedure. In this commit the state of the upgrade is always `use_pre_raft_procedures` because the upgrade procedure is not implemented yet. But stay tuned. The upgrade procedure will need certain guarantees: at some point it switches from `use_pre_raft_procedures` to `synchronize` state. During `synchronize` schema changes must be disabled, so the procedure can ensure that schema is in sync across the entire cluster before establishing group 0. Thus, when the switch happens, no schema change can be in progress. To handle all this weirdness we introduce `_upgrade_lock` and `get_group0_upgrade_state` which takes this lock whenever it returns `use_pre_raft_procedures`. Creating a `group0_guard` - which happens at the start of every group 0 operation - will take this lock, and the lock holder shall be stored inside the guard (note: the holder only holds the lock if `use_pre_raft_procedures` was returned, no need to hold it for other cases). Because `group0_guard` is held for the entire duration of a group 0 operation, and because the upgrade procedure will also have to take this lock whenever it wants to change the upgrade state (it's an rwlock), this ensures that no group 0 operation that uses the old ways is happening when we change the state. We also implement `wait_until_group0_upgraded` using a condition variable. It will be used by certain methods during upgrade (later commits; stay tuned). Some additional comments were written.	2022-08-19 19:15:19 +02:00
Kamil Braun	7e56251aea	service/raft: introduce `group0_upgrade_state` Define an enum class, `group0_upgrade_state`, describing the state of the upgrade procedure (implemented in later commits). Provide IDL definitions for (de)serialization. The node will have its current upgrade state stored on disk in `system.scylla_local` under the `group0_upgrade_state` key. If the key is not present we assume `use_pre_raft_procedures` (meaning we haven't started upgrading yet or we're at the beginning of upgrade). Introduce `system_keyspace` accessor methods for storing and retrieving the on-disk state.	2022-08-19 19:15:19 +02:00
Kamil Braun	b52429f724	Merge 'raft: relax some error severity' from Gleb Natapov Dtest fails if it sees an unknown errors in the logs. This series reduces severity of some errors (since they are actually expected during shutdown) and removes some others that duplicate already existing errors that dtest knows how to deal with. Also fix one case of unhandled exception in schema management code. * 'dtest-fixes-v1' of github.com:gleb-cloudius/scylla: raft: getting abort_requested_exception exception from a sm::apply is not a critical error schema_registry: fix abandoned feature warning service: raft: silence rpc::closed_errors in raft_rpc	2022-08-18 12:16:44 +02:00
Avi Kivity	8070cdbbf9	storage_proxy: mutate_counters_on_leader: coroutinize Simplify ahead of refactoring for consistent effective_replication_map.	2022-08-14 17:36:58 +03:00
Avi Kivity	6e330d98d2	storage_proxy: mutate_counters: coroutinize Simplify ahead of refactoring for consistent effective_replication_map. This is probably a pessimization of the error case, but the error case will be terrible in any case unless we resultify it.	2022-08-14 17:28:46 +03:00
Avi Kivity	105b066ff7	storage_proxy: mutate_counters: reorganize error handling Move the error handling function where it's used so the code is more straightforward. Due to some std::move()s later, we must still capture the schema early.	2022-08-14 17:13:22 +03:00
Benny Halevy	d295d8e280	everywhere: define locator::host_id as a strong tagged_uuid type So it can be distinguished from other uuid-based identifiers in the system. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11276	2022-08-12 06:01:44 +03:00
Botond Dénes	69aea59d97	Merge 'storage_proxy: use consistent topology, prepare for fencing' from Avi Kivity Replication is a mix of several inputs: tokens and token->node mappings (topology), the replication strategy, replication strategy parameters. These are all captured in effective_replication_map. However, if we use effective_replication_map:s captured at different times in a single query, then different uses may see different inputs to effective_replication_map. This series protects against that by capturing an effective_replication_map just once in a query, and then using it. Furthermore, the captured effective_replication_map is held until the query completes, so topology code can know when a topology is no longer is use (although this isn't exploited in this series). Only the simple read and write paths are covered. Counters and paxos are left for later. I don't think the series fixes any bugs - as far as I could tell everything was happening in the same continuation. But this series ensures it. Closes #11259 * github.com:scylladb/scylladb: storage_proxy: use consistent topology storage_proxy: use consistent replication map on read path storage_proxy: use consistent replication map on write path storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map consistency_level: accept effective_replication_map as parameter, rather than keyspace consistency_level: be more const when using replication_strategy	2022-08-12 06:00:30 +03:00
Avi Kivity	a2c4f5aa1a	storage_proxy: use consistent topology Derive the topology from captured and stable effective_replication_map instead of getting a fresh topology from storage_proxy, since the fresh topology may be inconsistent with the running query. digest_read_resolver did not capture an effective_replication_map, so that is added.	2022-08-11 17:58:42 +03:00
Avi Kivity	883518697b	storage_proxy: use consistent replication map on read path Capture a replication map just once in abstract_read_executor::_effective_replication_map_ptr. Although it isn't used yet, it serves to keep a reference count on topology (for fencing), and some accesses to topology within reads still remain, which can be converted to use the member in a later patch.	2022-08-11 17:58:42 +03:00
Avi Kivity	01a614fb4d	storage_proxy: use consistent replication map on write path Capture a replication map just once in abstract_write_handler::_effective_replication_map_ptr and use it in all write handlers. A few accesses to get the topology still remain, they will be fixed up in a later patch.	2022-08-11 17:58:42 +03:00
Avi Kivity	f1b0e3d58e	storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map Allow callers to use consistent effective_replication_map:s across calls by letting the caller select the object to use.	2022-08-11 17:58:42 +03:00
Avi Kivity	46bd0b1e62	consistency_level: accept effective_replication_map as parameter, rather than keyspace A keyspace is a mutable object that can change from time to time. An effective_replication_map captures the state of a keyspace at a point in time and can therefore be consistent (with care from the caller). Change consistency_level's functions to accept an effective_replication_map. This allows the caller to ensure that separate calls use the same information and are consistent with each other. Current callers are likely correct since they are called from one continuation, but it's better to be sure.	2022-08-11 17:58:42 +03:00
Gleb Natapov	eed8e19813	service: raft: silence rpc::closed_errors in raft_rpc Before the patch if an RPC connection was established already then the close error was reported by the RPC layer and then duplicated by raft_rpc layer. If a connection cannot be established because the remote node is already dead RPC does not report the error since we decided that in that case gossiper and failure detector messages can be used to detect the dead node case and there is no reason to pollute the logs with recurring errors. This aligns raft behaviour with what we already have in storage_proxy that does not report closed errors as well.	2022-08-11 15:11:21 +03:00
Petr Gusev	4bc6611829	raft read_barrier, retry over intermittent rpc failures If the leader was unavailable during read_barrier, closed_error occurs, which was not handled in any way and eventually reached the client. This patch adds retries in this case. Fix: scylladb#11262 Refs: #11278 Closes #11263	2022-08-11 13:31:19 +03:00
Amnon Heiman	5ac20ac861	Reduce the number of per-scheduling group metrics This patch reduces the number of metrics ScyllaDB generates. Motivation: The combination of per-shard with per-scheduling group generates a lot of metrics. When combined with histograms, which require many metrics, the problem becomes even bigger. The two tools we are going to use: 1. Replace per-shard histograms with summaries 2. Do not report unused metrics. The storage_proxy stats holds information for the API and the metrics layer. We replaced timed_rate_moving_average_and_histogram and time_estimated_histogram with the unfied timed_rate_moving_average_summary_and_histogram which give us an option to report per-shard summaries instead of histogram. All the counters, histograms, and summaries were marked as skip_when_empty. The API was modified to use timed_rate_moving_average_summary_and_histogram. Closes #11173	2022-08-11 13:31:19 +03:00
Avi Kivity	e9cbc9ee85	Merge 'Add support for empty replica pages' from Botond Dénes Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones. The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by `3131cbea62`, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones. The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set. Upgrade sanity test was conducted as following: * Created cluster of 3 nodes with RF=3 with master version * Wrote small dataset of 1000 rows. * Deleted prefix of 980 rows. * Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100` * Also did some manual queries via `cqlsh` with smaller page size and tracing on. * Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`. * Confirmed there are no errors or read-repairs. Perf regression test: ``` build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60 ``` Before: ``` median 133665.96 tps ( 62.0 allocs/op, 12.0 tasks/op, 43007 insns/op, 0 errors) median absolute deviation: 973.40 maximum: 135511.63 minimum: 104978.74 ``` After: ``` median 129984.90 tps ( 62.0 allocs/op, 12.0 tasks/op, 43181 insns/op, 0 errors) median absolute deviation: 2979.13 maximum: 134538.13 minimum: 114688.07 ``` Diff: +~200 instruction/op. Fixes: https://github.com/scylladb/scylla/issues/7689 Fixes: https://github.com/scylladb/scylla/issues/3914 Fixes: https://github.com/scylladb/scylla/issues/7933 Refs: https://github.com/scylladb/scylla/issues/3672 Closes #11053 * github.com:scylladb/scylladb: test/cql-pytest: add test for query tombstone page limit query-result-writer: stop when tombstone-limit is reached service/pager: prepare for empty pages service/storage_proxy: set smallest continue pos as query's continue pos service/storage_proxy: propagate last position on digest reads query: result_merger::get() don't reset last-pos on short-reads and last pages query: add tombstone-limit to read-command service/storage_proxy: add get_tombstone_limit() query: add tombstone_limit type db/config: add config item for query tombstone limit gms: add cluster feature for empty replica pages tree: don't use query::read_command's IDL constructor	2022-08-10 13:38:06 +03:00
Botond Dénes	8066dbc635	service/pager: prepare for empty pages The pager currently assumes that an empty pages means the query is exhausted. Lift this assumption, as we will soon have empty short pages. Also, paging using filtering also needs to use the replica-provided last-position when the page is empty.	2022-08-10 06:03:38 +03:00
Botond Dénes	6a7dedfe34	service/storage_proxy: set smallest continue pos as query's continue pos We expect each replica to stop at exactly the same position when the digests match. Soon however, if replicas have a lot of tombstones, some may stop earlier then the others. As long as all digests match, this is fine but we need to make sure we continue from the smallest such positions on the next page.	2022-08-10 06:03:38 +03:00
Botond Dénes	2656968db2	service/storage_proxy: propagate last position on digest reads We want to transmit the last position as determined by the replica on both result and digest reads. Result reads already do that via the query::result, but digest reads don't yet as they don't return the full query::result structure, just the digest field from it. Add the last position to the digest read's return value and collect these in the digest resolver, along with the returned digests.	2022-08-10 06:03:37 +03:00
Botond Dénes	d1d53f1b84	query: add tombstone-limit to read-command Propagate the tombstone-limit from coordinator to replicas, to make sure all is using the same limit.	2022-08-10 06:01:47 +03:00
Avi Kivity	be44fd63f9	Merge 'Make get_range_addresses async and hold effective_replication_map_ptr around it' from Benny Halevy This series converts the synchronous `effective_replication_map::get_range_addresses` to async by calling the replication strategy async entry point with the same name, as its callers are already async or can be made so easily. To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map, let the callers of this patch hold a effective_replication_map per keyspace and pass it down to the (now asynchronous) functions that use it (making affected storage_service methods static where possible if they no longer depend on the storage_service instance). Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate that is true for local_strategy and everywhere_replication_strategy, and is false otherwise. With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow. Refs #11005 Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping) Closes #11009 * github.com:scylladb/scylladb: effective_replication_map: make get_range_addresses asynchronous range_streamer: add_ranges and friends: get erm as param storage_service: get_new_source_ranges: get erm as param storage_service: get_changed_ranges_for_leaving: get erm as param storage_service: get_ranges_for_endpoint: get erm as param repair: use get_non_local_strategy_keyspaces_erms database: add get_non_local_strategy_keyspaces_erms database: add get_non_local_strategy_keyspaces storage_service: coroutinize update_pending_ranges effective_replication_map: add get_replication_strategy effective_replication_map: get_range_addresses: use the precalculated replication_map abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies abstract_replication_strategy: reindent utils: sequenced_set: expose set and `contains` method abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set utils: sequenced_set: templatize VectorType utils: sanitize sequenced_set utils: sequenced_set: delete mutable get_vector method	2022-08-09 13:25:53 +03:00
Asias He	12ab2c3d8d	storage_service: Prevent removed node to restart and join the cluster 1) Start node1,2,3 2) Stop node3 3) Run nodetool removenode $host_id_of_node3 4) Restart node3 Step 4 is wrong and not allowed. If it happens it will bring back node3 to the cluster. This patch adds a check during node restart to detect such operation error and reject the restart. With this patch, we would see the following in step 4. ``` init - Startup failed: std::runtime_error (The node 127.0.0.3 with host_id fa7e500a-8617-4de4-8efd-a0e177218ee8 is removed from the cluster. Can not restart the removed node to join the cluster again!) ``` Refs #11217 Closes #11244	2022-08-09 12:46:21 +03:00
Raphael S. Carvalho	337390d374	forward_service: execute_on_this_shard: avoid reallocation and copy avoid about log2(256)=8 reallocations when pushing partition ranges to be fetched. additionally, also avoid copying range into ranges container. current_range will not contain the last range, after moved, but will still be engaged by the end of the loop, allowing next iteration to happen as expected. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11242	2022-08-09 09:08:53 +02:00
Botond Dénes	1b669cefed	service/storage_proxy: add get_tombstone_limit() To be used by coordinator side code to determine the correct tombstone limit to pass to read-command (tombstone limit field added in the next commit). When this limit is non-zero, the replica will start cutting pages after the tombstone limit is surpassed. This getter works similarly to `get_max_result_size()`: if the cluster feature for empty replica pages is set, it will return the value configured via db::config::query_tombstone_limit. System queries always use a limit of 0 (unlimited tombstones).	2022-08-09 10:00:40 +03:00
Benny Halevy	91ab8ee1c3	effective_replication_map: make get_range_addresses asynchronous So it may yield, preenting reactor stalls as seen in #11005. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	9b2af3f542	range_streamer: add_ranges and friends: get erm as param Rather than getting it in the callee, let the caller (e.g. storage_service) hold the erm and pass it down to potentially multiple async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	194b9af8d6	storage_service: get_new_source_ranges: get erm as param Rather than getting it in the callee, let the caller hold the erm and pass it down to potentially multiple async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	b50c79eab3	storage_service: get_changed_ranges_for_leaving: get erm as param Rather than getting it in the callee, let the caller hold the erm and pass it down to potentially multiple async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	a5d7ade237	storage_service: get_ranges_for_endpoint: get erm as param Let its caller Pass the effective_replication_map ptr so we can get it at the top level and keep it alive (and coherent) through multiple asynchronous calls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	db5c5ca59e	database: add get_non_local_strategy_keyspaces_erms To be used for getting a coheret set of all keyspaces with non-local replication strategy and their respective effective_replication_map. As an example, use it in this patch in storage_service::update_pending_ranges. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	7ee6048255	database: add get_non_local_strategy_keyspaces For node operations, we currently call get_non_system_keyspaces but really want to work on all keyspace that have non-local replication strategy as they are replicated on other nodes. Reflect that in the replica::database function name. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00
Benny Halevy	d8484b3ee6	storage_service: coroutinize update_pending_ranges Before we make a change in getting the keyspaces and their effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 17:31:01 +03:00

1 2 3 4 5 ...

2971 Commits