scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 14:15:46 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	6a154305d7	gossiper: Remove db::config reference from gossiper Also const-ify the db::config reference argument and std::move the gossip_config argument while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 18:34:55 +03:00
Pavel Emelyanov	0c24087007	gossiper: Keep live-updateable options on gossiper These options need to have updateable_value<> instance referencing them from gossiper itself. The updateable_value<> is shard-aware in the sense that it should be constructed on correct shard. This patch does this -- the db::config reference is carried all the way down to the gossiper constructor, then each instance gets its shard-local construction of the updateable_value<>s. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 18:34:55 +03:00
Pavel Emelyanov	271ceb57b9	gossiper: Keep immutable options on gossip_config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-02 18:34:55 +03:00
Amnon Heiman	c764f0d0f8	gms/gossiper.cc: Add gauge for live and unreachable nodes this patch adds two gauges: scylla_gossip_live - how many live nodes the gossiper sees scylla_gossip_unreachable - how many nodes the gossiper tries to connect to but cannot. Both metrics are reported once per node (i.e., per node, not per shard) it gives visibility to how a specific node sees the cluster. For example, a split-brain 6 nodes cluster (3 and 3). Each node would report that it sees 2 nodes, but the monitoring system would see that there are, in fact, 6 nodes. Example of two nodes cluster, both running: `` scylla_gossip_live{shard="0"} 1.000000 scylla_gossip_unreachable{shard="0"} 0.000000 `` Example of two nodes cluster, one is down: `` scylla_gossip_live{shard="0"} 0.000000 scylla_gossip_unreachable{shard="0"} 1.000000 `` Fixes #10102 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes #10103 [avi: remove whitespace change and correct spelling]	2022-02-20 19:42:58 +02:00
Michael Livshin	d370558279	add "ME_SSTABLE" cluster feature Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	0b1447c702	add "sstable_format" config Initialize it to "md" until ME format support is complete (i.e. storing originating host id in sstable stats metadata is implemented), so at present there is no observable change by default. Also declare "enable_sstables_md_format" unused -- the idea, going forward, being that only "sstable_format" controls the written sstable file format and that no more per-format enablement config options shall be added. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Pavel Solodovnikov	dce3159156	gms: gossiper: coroutinize `wait_for_gossip` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:34:52 +03:00
Pavel Solodovnikov	ab41151a41	gms: gossiper: coroutinize `advertise_token_removed` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:33:32 +03:00
Pavel Solodovnikov	4416070f56	gms: gossiper: coroutinize `advertise_removing` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:33:13 +03:00
Pavel Solodovnikov	e9f5da9507	gms: gossiper: don't wrap `convict` calls into `seastar::async` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:32:14 +03:00
Pavel Solodovnikov	e26829e202	gms: gossiper: coroutinize `handle_major_state_change` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	705a759891	gms: gossiper: coroutinize `handle_shutdown_msg` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	9ce0e2efa3	gms: gossiper: coroutinize `mark_as_shutdown` and `convict` Since these two functions call each other, convert to coroutines and eliminate the dependency on `seastar::async` for both of them at the same time. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	c584a9cc1f	gms: gossiper: remove comment about requiring thread context in `mark_alive` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	ee30d0a385	gms: gossiper: don't use `seastar::async` in `mark_alive` Since `real_mark_alive` does not require `seastar::async` now, we can eliminate the wrapping async call, as well. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	529f4d0f98	gms: gossiper: coroutinize `do_on_change_notifications` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	37066039df	gms: gossiper: coroutinize `do_before_change_notifications` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	231d8a3ad4	gms: gossiper: coroutinize `real_mark_alive` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:21 +03:00
Pavel Solodovnikov	c929f23b8d	gms: gossiper: coroutinize `mark_dead` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-02-05 10:15:20 +03:00
Michał Sala	3789a4d02b	gms: add PARALLELIZED_AGGREGATION feature This new feature will be used to determined whether the whole cluster is ready to parallelize execution of aggregation queries.	2022-02-01 21:14:41 +01:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Pavel Solodovnikov	5dcfb94d5a	gms: i_endpoint_state_change_subscriber: make callbacks to return futures Coroutinize a few simple callbacks in the process. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	b958e85c54	utils: atomic_vector: rename `for_each` to `thread_for_each` To emphasize that the function requires `seastar::thread` context to function properly. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	445876a125	gms: gossiper: coroutinize `start_gossiping` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	04b3172e6b	gms: gossiper: coroutinize `force_remove_endpoint` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	a01c900d66	gms: gossiper: coroutinize `do_status_check` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Pavel Solodovnikov	42ff01eee2	gms: gossiper: coroutinize `remove_endpoint` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Avi Kivity	ae3a360725	database: Move database, keyspace, table classes to replica/ directory The database, keyspace, and table classes represent the replica-only part of the objects after which they are named. Reading from a table doesn't give you the full data, just the replica's view, and it is not consistent since reconciliation is applied on the coordinator. As a first step in acknowledging this, move the related files to a replica/ subdirectory.	2022-01-06 17:07:30 +02:00
Avi Kivity	d01e1a774b	Merge 'Build performance: do not include the entire <seastar/net/ip.hh>' from Nadav Har'El The header file <seastar/net/ip.hh> is a large collection of unrelated stuff, and according to ClangBuildAnalyzer, takes 2 seconds to compile for every source file that included it - and unfortunately virtually all Scylla source files included it - through either "types.hh" or "gms/inet_address.hh". That's 2300 CPU seconds wasted. In this two-patch series we completely eliminate the inclusion of <seastar/net/ip.hh> from Scylla. We still need the ipv4_address, ipv6_address types (e.g., gms/inet_address.hh uses it to hold a node's IP address) so those were split (in a Seastar patch that is already in) from ip.hh into separate small header files that we can include. This patch reduces the entire build time (of build/dev/scylla) by 4% - reducing almost 10 sCPU minutes (!) from the build. Closes #9875 github.com:scylladb/scylla: build performance: do not include <seastar/net/ip.hh> build performance: speed up inclusion of <gm/inet_address.hh>	2022-01-05 17:55:07 +02:00
Raphael S. Carvalho	426450dc04	treewide: remove useless include of database.hh Wrote a script based on cpp-include to find places that needlessly included database.hh, which is expensive to process during build time. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220104204359.168895-1-raphaelsc@scylladb.com>	2022-01-05 10:15:19 +02:00
Nadav Har'El	3fbbad7d60	build performance: speed up inclusion of <gm/inet_address.hh> The header file <gm/inet_address.hh> is included, directly or indirectly, from 291 source files in Scylla. It is hard to reduce this number because Scylla relies heavily on IP addresses as keys to different things. So it is important that this header file be fast to include. Unfortunately it wasn't... ClangBuildAnalyzer measurements showed that each inclusion of this header file added a whopping 2 seconds (in dev build mode) to the build. A total of 600 CPU seconds - 10 CPU minutes - were spent just on this header file. It was actually worse because the build also spent additional time on template instantiation (more on this below). So in this patch we: 1. Remove some unnecessary stuff from gms/inet_address.hh, and avoid including it in one place that doesn't need it. This is just cosmetic, and doesn't significantly speed up the build. 2. Move the to_sstring() implementation for the .hh to .cc. This saves a lot of time on template instantiations - previously every source file instantiated this to_sstring(), which was slow (that "format" thing is slow). 3. Do not include <seastar/net/ip.hh> which is a huge file including half the world. All we need from it is the type "ipv4_address", so instead include just the new <seastar/net/ipv4_address.hh>. This change brings most of the performance improvement. So source files forgot to include various Seastar header files because the includes-everything ip.hh did it - so we need to add these missing includes in this patch. After this patch, ClangBuildAnalyzer's reports that the cost of inclusion of <gms/inet_address.hh> is down from 2 seconds to 0.326 seconds. Additionally the format<inet_address> template instantiation 291 times - about half a second each - is also gone. All in all, this patch should reduce around 10 CPU minutes from the build. Refs #1 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-01-04 21:07:23 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Pavel Solodovnikov	904de0a094	gms: introduce two gossip features for raft-based cluster management The patch adds the `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` and `USES_RAFT_CLUSTER_MANAGEMENT` gossiper features. These features provide a way to organize the automatic switch to raft-based cluster management. The scheme is as follows: 1. Every new node declares support for raft-based cluster ops. 2. At the moment, no nodes in the cluster can actually use raft for cluster management, until the `SUPPORTS` feature is enabled (i.e. understood by every node in the cluster). 3. After the first `SUPPORTS` feature is enabled, the nodes can declare support for the second, `USES*` feature, which means that the node can actually switch to use raft-based cluster ops. The scheme ensures that even if some nodes are down while transitioning to new bootstrap mechanism, they can easily switch to the new procedure, not risking to disrupt the cluster. The features are not actually wired to anything yet, providing a framework for the integration with `raft_group0` code, which is subject for a follow-up series. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20211220081318.274315-1-pa.solodovnikov@scylladb.com>	2021-12-30 11:05:45 +02:00
Avi Kivity	4a323772c1	Merge 'Use the same page size limit in reverse queries as in forward reads' from Piotr Jastrzębski The default for get_unlimited_query_max_result_size() is 100MB (adjustable through config), whereas query::result_memory_limiter::maximum_result_size is 1MB (hard coded, should be enough for everybody) This limit is then used by the replica to decide when to break pages and, in case of reversed clustering order reads, when to fail the read when accumulated data crosses the threshold. The latter behavior stems from the fact that reversed reads had to accumulate all the data (read in forward order) before they can reverse it and return the result. Reverse reads thus need a higher limit so that they have a higher chance of succeeding. Most readers are now supporting reading in reverse natively, and only reversing wrappers (make_reversing_reader()) inserted on top of ka/la sstable readers need to accumulate all the data. In other cases, we could break pages sooner. This should lead to better stability (less memory usage) and performance (lower page build latency, higher read concurrency due to less memory footprint). Tests: unit(dev) Closes #9815 * github.com:scylladb/scylla: storage_proxy: Send page_size in the read_command gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature result_memory_accounter: use new max_result_size::get_page_size in check_local_limit max_result_size: Add page_size field	2021-12-29 15:04:01 +02:00
Avi Kivity	49a603af39	gossip: fix lowres_clock::duration assumption The variable diff is assigned a type of std::chrono::milliseconds but later used to store the difference between two lowres_clock::time_point samples. This works now because the two types are the same, but fails if lowres_clock::duration changes. Remove the assumption by using lowres_clock::duration.	2021-12-28 21:13:59 +02:00
Piotr Jastrzebski	02d5997377	gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature This new feature will be used to determined whether the whole cluster is ready to use additional page_size field in max_result_size. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-12-28 16:38:02 +01:00
Nadav Har'El	9ae98dbe92	Merge 'Reduce boot time for dtest setup' from Asias He This patch helps to speed up node boot up for test setups like dtest. Nadav reported ``` With Asias's two patches o Scylla, and my patch to enable it in dtest: Boot time of 5 nodes is now down to 9 seconds! Remember we started this exercise with 214 seconds? :-) ``` Closes #9808 * github.com:scylladb/scylla: storage_service: Recheck tokens before throw in storage_service::bootstrap gossip: Dot not wait for gossip to settle if skip_wait_for_gossip_to_settle is zero	2021-12-16 13:44:42 +02:00
Avi Kivity	87917d2536	Merge "gms: gossiper: coroutinize a few small functions" from Pavel S " Start converting small functions in gossiper code from using `seastar::thread` context to coroutines. For now, the changes are quite trivial. Later, larger code fragments will be converted to eliminate uses of `seastar::async` function calls. Moving the code to coroutines makes the code a bit more readable and also mmediately evident that a given function is async just looking at the signature (for example, for void-returning functions, a coroutine will return `future<>` instead of `void` in case of a seastar::thread-using function). Tests: unit(dev) " * 'coro_gossip_v1' of https://github.com/ManManson/scylla: gms: gossiper: coroutinize `maybe_enable_features` gms: gossiper: coroutinize `wait_alive` gms: gossiper: coroutinize `add_saved_endpoint` gms: gossiper: coroutinize `evict_from_membership`	2021-12-15 16:02:18 +02:00
Asias He	78d0cc4ab5	gossip: Dot not wait for gossip to settle if skip_wait_for_gossip_to_settle is zero The skip_wait_for_gossip_to_settle == 0 which means do not wait for gossip to settle at all. It is not respected in gossiper::wait_for_range_setup and in gossiper::wait_for_gossip for initial sleeps. Since setting skip_wait_for_gossip_to_settle zero is not allowed in production cluster anyway. It is mostly used by tests like dtest to reduce the cluster boot up time. Respect skip_wait_for_gossip_to_settle zero flag and avoid any sleep and wait completely.	2021-12-15 19:40:43 +08:00
Pavel Solodovnikov	47533bca65	gms: gossiper: coroutinize `maybe_enable_features` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-11 09:39:48 +03:00
Pavel Solodovnikov	3993c6a9fb	gms: gossiper: coroutinize `wait_alive` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-11 09:30:32 +03:00
Pavel Solodovnikov	a6ff04dd24	gms: gossiper: coroutinize `add_saved_endpoint` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-11 09:23:35 +03:00
Pavel Solodovnikov	23dd8b66c5	gms: gossiper: coroutinize `evict_from_membership` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-11 09:15:03 +03:00
Juliusz Stasiewicz	351f142791	cdc/check_and_repair_cdc_streams: ignore LEFT endpoints When `check_and_repair_cdc_streams` encountered a node with status LEFT, Scylla would throw. This behavior is fixed so that LEFT nodes are simply ignored. Fixes #9771 Closes #9778	2021-12-10 15:28:14 +01:00
Pavel Solodovnikov	777985b64d	gms: gossiper: maybe_enable_features() should enable features in seastar::async context Since `gms::feature::enable()` requires `seastar::async` context to be present. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Pavel Solodovnikov	5b5fbb4b33	gms: feature_service: expose registered features map This will be used for re-enabling previously enabled cluster features, which will be introduces in later patches. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Pavel Solodovnikov	a2f5ad432f	gms: feature_service: persist enabled features Save each feature enabled through the feature_service instance in the `system.scylla_local` under the 'enabled_features' key. The features would be persisted only if the underlying query context used by `db::system_keyspace` is initialized. Since `system.scylla_local` table is essentially a string->string map, use an ad-hoc method for serializing enabled features set: the same as used in gossiper for translating supported features set via gossip. The entry should be saved before we enable the feature so that crash-after-enable is safe. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Pavel Solodovnikov	e891f874df	gms: move `to_feature_set()` function from gossiper to feature_service This utility will also be used for de-serialization of persisted enabled features, which will be introduced in a later patch. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Benny Halevy	55967a8597	batchlog_manager: endpoint_filter: move to gossiper There's nothing in this function that actually requries the batchlog manager instance. It uses a random number engine that's moved along with it to class gossiper. This resolves a circular dependency between the batchlog_manager and storage_proxy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Pavel Emelyanov	9fccf7f3af	gossiper: Guard background processing with gate When shutdown gossiper may have some messages being processed in the background. This brings two problems. First, the gossiper itself is about to disappear soon and messages might step on the freed instance (however, this one is not real now, gossiper is not freed for real, just ::stop() is called). Second, messages processing may notify other subsystems which, in turn, do not expect this after gossiper is shutdown. The common solution to this is to run background code through a gate that gets closed at some point, the ::shutdown() in gossiper case. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 10:25:03 +03:00

1 2 3 4 5 ...

725 Commits