scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 10:00:35 +00:00

Author	SHA1	Message	Date
Piotr Sarna	d45574ed28	sys_dist_ks: fix redundant parsing in get_service_level The routine used for getting service level information already operates on the service level name, but the same information is also parsed once more from a row from an internal table. This parsing is redundant, so it's hereby removed.	2021-05-27 14:31:26 +02:00
Piotr Sarna	7faba19605	sys_dist_ks: make get_service_level exception-safe In order to avoid killing the node if a parsing error occurs, the routine which fetches service level information is made exception-safe.	2021-05-27 14:31:25 +02:00
Piotr Sarna	cb27ebe61d	transport: start shedding requests during potential overload This commit implements the following overload prevention heuristics: if the admission queue becomes full, a timer is armed for 50ms. If any of the ongoing requests finishes, the timer is disarmed, but if that doesn't happen, the server goes into shedding mode, which means that it reads new requests from the socket and immediately drops them until one of the ongoing requests finishes. This heuristics is not recommended for OLAP workloads, so it is applied only if the session declared itself as interactive (via service level's workload_type parameter).	2021-05-27 13:02:22 +02:00
Piotr Sarna	409c67b1b4	client_state: hook workload type from service levels The client state is now aware of its workload type derived from its attached service level.	2021-05-27 13:02:22 +02:00
Piotr Sarna	762e2f48f2	cql3: add listing service level workload type The workload type information is now presented in the output of LIST SERVICE LEVEL and LIST ALL SERVICE LEVELS statements.	2021-05-27 13:02:22 +02:00
Piotr Sarna	4816678eb6	cql3: add persisting service level workload type The workload type information can now be set via CQL and it's persisted in the distributed system table.	2021-05-27 13:02:22 +02:00
Piotr Sarna	578543603d	qos: add workload_type service level parameter The workload type is currently one of three values: - unspecified - interactive - batch By defining the workload type, the service level makes it easier for other components to decide what to do in overload scenarios. E.g. if the workload is interactive, requests can be shed earlier, while if it's batched (or unspecified), shedding does not take place. Conversely, batch workloads could accept long full scan operations.	2021-05-27 13:02:22 +02:00
Asias He	72cc596842	repair: Wire off-strategy compaction for regular repair We have enabled off-strategy compaction for bootstrap, replace, decommission and removenode operations when repair based node operation is enabled. Unlike node operations like replace or decommission, it is harder to know when the repair of a table is finished because users can send multiple repair requests one after another, each request repairing a few token ranges. This patch wires off-strategy compaction for regular repair by adding a timeout based automatic off-strategy compaction trigger mechanism. If there is no repair activity for sometime, off-strategy compaction will be triggered for that table automatically. Fixes #8677 Closes #8678	2021-05-26 11:41:27 +03:00
Konstantin Osipov	ac43941f17	rpc: don't include an unused header (raft_services.hh) Message-Id: <20210525183919.1395607-7-kostja@scylladb.com>	2021-05-26 11:07:44 +03:00
Konstantin Osipov	7ca4ffc309	system_keyspace: coroutinize db::system_keyspace::setup() Message-Id: <20210525183919.1395607-19-kostja@scylladb.com>	2021-05-26 11:06:21 +03:00
Avi Kivity	e2e723cc4c	build: enable -Wrange-loop-construct warning This warning triggers when a range for ("for (auto x : range)") causes non-trivial copies, prompting the developer to replace with a capture by reference. A few minor violations in the test suite are corrected. Closes #8699	2021-05-26 10:32:56 +03:00
Avi Kivity	3896e35897	Merge 'storage_service: Respect --enable-repair-based-node-ops flag during removenode' from Asias He In commit `829b4c1` (repair: Make removenode safe by default), removenode was changed to use repair based node operations unconditionally. Since repair based node operations is not enabled by default, we should respect the flag to use stream to sync data if the flag is false. Fixes #8700 Closes #8701 * github.com:scylladb/scylla: storage_service: Add removenode_add_ranges helper storage_service: Respect --enable-repair-based-node-ops flag during removenode	2021-05-26 10:32:56 +03:00
Avi Kivity	e9c940dbbc	Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund Fixes #8270 If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall. We need to take a few things into account: 1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here. 2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task. 3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit. Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything). Closes #8695 * github.com:scylladb/scylla: commitlog_test: Add test case for usage/disk size threshold mismatch commitlog: Flush all segments if we only have one. commitlog: Always force flush if segment allocation is waiting commitlog: Include segment wasted (slack) size in footprint check commitlog: Adjust (lower) usage threshold	2021-05-25 18:34:29 +03:00
Calle Wilund	a96433c684	commitlog_test: Add test case for usage/disk size threshold mismatch Refs #8270 Tries to simulate case where we mismatch segments usage with actual disk footprint and fail to flush enough to allow segment recycling	2021-05-25 12:43:12 +00:00
Calle Wilund	bf0a91b566	commitlog: Flush all segments if we only have one. Handle test cases with borked config so we don't deadlock in cases where we only have one segment in a commitlog	2021-05-25 12:43:12 +00:00
Calle Wilund	8ce836209b	commitlog: Always force flush if segment allocation is waiting Refs #8270 If segement allocation is blocked, we should bypass all thresholds and issue a flush of as much as possible.	2021-05-25 12:43:12 +00:00
Calle Wilund	e34ed30178	commitlog: Include segment wasted (slack) size in footprint check Refs #8270 Since segment allocation looks at actual disk footprint, not active, the threshold check in timer task should include slack space so we don't mistake sparse usage for space left.	2021-05-25 12:43:12 +00:00
Calle Wilund	ec40207e7f	commitlog: Adjust (lower) usage threshold Refs #8270 Try to ensure we issue a flush as soon as we are allocating in the last allowable segment, instead of "half through". This will make flushing a little more eager, but should reduce latencies created by waiting for segment delete/recycle on heavy usage.	2021-05-25 12:43:12 +00:00
Benny Halevy	6144656b25	table: seal_active_memtable: update stats also on the error path Currently the pending (memtables) flushes stats are adjusted back only on success, therefore they will "leak" on error, so move use a .then_wrapped clause to always update the stats. Note that _commitlog->discard_completed_segments is still called only on success, and so is returning the previous_flush future. Test: unit(dev) DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210525055336.1190029-2-bhalevy@scylladb.com>	2021-05-25 12:51:54 +02:00
Benny Halevy	d46958d3ce	phased_barrier: advance_and_await: abort on allocation failure Currently, advance_and_wait() allocates a new gate which might fail. Rather than returning this failure as an exceptional future - which will require its callers to handle that failure, keep the function as noexcept and let an exception from make_lw_shared<gate>() terminate the program. This makes the function "fail-free" to its callers, in particular, when called from the table::stop() path where we can't do much about these failures and we require close/stop functions to always succeed. The alternative of make the allocation of a new gate optional and covering from it in start() is possible but was deemed not worth it as it will add complexity and cost to start() that's called on the common, hot, path. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210525055336.1190029-1-bhalevy@scylladb.com>	2021-05-25 12:50:59 +02:00
Avi Kivity	e391e4a398	test: serialized_action_test: prevent false-positive timeout in test_phased_barrier_reassignment test_phased_barrier_reassignment has a timeout to prevent the test from hanging on failure, but it occastionally triggers in debug mode since the timeout is quite low (1ms). Increase the timeout to prevent false positives. Since the timeout only expires if the test fails, it will have no impact on execution time. Ref #8613 Closes #8692	2021-05-25 11:20:18 +02:00
Benny Halevy	3ad0f156b9	memtable_list: request_flush: wait on pending flushes also when empty() In https://github.com/scylladb/scylla/issues/8609, table::stop() that is called from database::drop_column_family is expected to wait on outstanding flushes by calling _memtable->request_flush(), but the memtable_list is considered empty() at this point as it has a single empty memtable, so request_flush() returns a ready future, without waiting on outstanding flushes. This change replaces the call to request_flush with flush(). Fix that by either returning _flush_coalescing future that resolves when the memtable is sealed, if available, or go through the get_flush_permit and _dirty_memory_manager->flush_one song and dance, even though the memtable is empty(), as the latter waits on pending flushes. Fixes #8609 Test: unit(dev) DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210524143438.1056014-1-bhalevy@scylladb.com>	2021-05-25 11:19:51 +02:00
Kamil Braun	d71513d814	abstract_replication_strategy: avoid reactor stalls in `get_address_ranges` and friends The algorithm used in `get_address_ranges` and `get_range_addresses` calls `calculate_natural_endpoints` in a loop; the loop iterates over all tokens in the token ring. If the complexity of a particular implementation of `calculate_natural_endpoints` is large - say `θ(n)`, where `n` is the number of tokens - this results in an `θ(n^2)` algorithm (or worse). This case happens for `Everywhere` replication strategy. For small clusters this doesn't matter that much, but if `n` is, say, `20*255`, this may result in huge reactor stalls, as observed in practice. We avoid these stalls by inserting tactical yields. We hope that some day someone actually implements a subquadratic algortihm here. The commit also adds a comment on `abstract_replication_strategy::calculate_natural_endpoints` explaining that the interface does not give a complexity guarantee (at this point); the different implementations have different complexities. For example, `Everywhere` implementation always iterates over all tokens in the token ring, so it has `θ(n)` worst and best case complexity. On the other hand, `NetworkTopologyStrategy` implementation usually finishes after visiting a small part of the token ring (specifically, as soon as it finds a token for each node in the ring) and performs a constant number of operations for each visited token on average, but theoretically its worst case complexity is actually `O(n + k^2)`, where `n` is the number of all tokens and `k` is the number of endpoints (the `k^2` appears since for each endpoint we must perform finds and inserts on `unordered_set` of size `O(k)`; `unordered_set` operations have `O(1)` average complexity but `O(size of the set)` worst case complexity). Therefore it's not easy to put any complexity guarantee in the interface at this point. Instead, we say that: - some implementations may yield - if their complexities force us to do so - but in general, there is no guarantee that the implementation may yield - e.g. the `Everywhere` implementation does not yield. Fixes #8555. Closes #8647	2021-05-25 11:53:28 +03:00
Raphael S. Carvalho	ee39eb9042	sstables: Fix slow off-strategy compaction on STCS tables Off-strategy compaction on a table using STCS is slow because of the needless write amplification of 2. That's because STCS reshape isn't taking advantage of the fact that sstables produced by a repair-based operation are disjoint. So the ~256 input sstables were compacted (in batches of 32) into larger sstables, which in turn were compacted into even larger ones. That write amp is very significant on large data sets, making the whole operation 2x slower. Fixes #8449. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>	2021-05-25 11:24:42 +03:00
Asias He	70147dcb5a	storage_service: Add removenode_add_ranges helper Share the code between restore_replica_count and removenode_with_stream to reduce duplication. Refs #8700	2021-05-25 10:44:31 +08:00
Asias He	a285bd28e2	storage_service: Respect --enable-repair-based-node-ops flag during removenode In commit `829b4c1` (repair: Make removenode safe by default), removenode was changed to use repair based node operations unconditionally. Since repair based node operations is not enabled by default, we should respect the flag to use stream to sync data if the flag is false. Fixes #8700	2021-05-25 10:42:58 +08:00
Avi Kivity	948e2c0b36	utils: config_file: delete unneeded template instantation of operator<<() config_file.cc instantiates std::istream& std::operator>>(std::istream&, std::unordered_map<seastar::sstring, seastar::sstring>&), but that instantiation is ignored since config_file_impl.hh specializes that signature. -Winstantiation-after-specialization warns about it, so re-enable it now that the code base is clean. Also remove the matching "extern template" declaration, which has no definition any more. Closes #8696	2021-05-24 18:34:45 +03:00
Avi Kivity	60fb224171	Update seastar submodule * seastar 28dddd2683...f0f28d07e1 (4): > httpd: allow handler to not read an empty content Fixes #8691. > compat: source_location: implement if no std or experimental are available > compat: source_location: declare using in seastar::compat namespace > perftune.py: fix a bug in mlx4 IRQs names matching pattern	2021-05-24 17:44:08 +03:00
Piotr Sarna	95c6ec1528	Merge 'test/cql-pytest: clean up tests to run on Cassandra' from Nadav Har'El To keep our cql-pytest tests "correct", we should strive for them to pass on Cassandra - unless they are testing a Scylla-only feature or a deliberate difference between Scylla and Cassandra - in which case they should be marked "scylla-only" and cause such tests to be skipped when running on Cassandra. The following few small patches fix a few cases where our tests we failing on Cassandra. In one case this even found a bug in the test (a trivial Python mistake, but still). Closes #8694 * github.com:scylladb/scylla: test/cql-pytest: fix python mistake in an xfailing test test/cql-pytest: mark some tests with scylla-only test/cql-pytest: clean up test_create_large_static_cells_and_rows	2021-05-24 16:42:01 +02:00
Avi Kivity	789757a692	Merge 'cql3: represent lists as chunked_vector instead of std::vector' from Michał Chojnowski The cql3 layer manipulates lists as `std::vector`s (of `managed_bytes_opt`). Since lists can be arbitrarily large, let's use chunked vectors there to prevent potentially large contiguous allocations. Closes #8668 * github.com:scylladb/scylla: cql3: change the internal type of tuples::in_value from std::vector to chunked_vector cql3: change the internal type of lists::value from std::vector to chunked_vector cql3: in multi_item_terminal, return the vector of items by value	2021-05-24 17:19:45 +03:00
Nadav Har'El	edc2c65552	Merge 'Fix service level negative timeouts' from Piotr Sarna This series fixes a minor validation issue with service level timeouts - negative values were not checked. This bug is benign because negative timeouts act just like a 0s timeout, but the original series claimed to validate against negative values, so it's hereby fixed. More importantly however, this series follows by enabling cql-pytest to run service level tests and provides a first batch of them, including a missing test case for negative timeouts. The idea is similar to what we already have in alternator test suite - authentication is unconditionally enabled, which doesn't affect any existing tests, but at the same time allows writing test cases which rely on authentication - e.g. service levels. Closes #8645 * github.com:scylladb/scylla: cql-pytest: introduce service level test suite cql-pytest: add enabling authentication by default qos: fix validating service level timeouts for negative values	2021-05-24 16:30:13 +03:00
Tomasz Grabiec	b1821c773f	Merge "raft: basic RPC module testing" from Pavel Solodovnikov Now RPC module has some basic testing coverage to make sure RPC configuration is updated appropriately on configuration changes (i.e. `add_server` and `remove_server` are called when appropriate). The test suite currenty consists of the following test-cases: * Loading server instance with configuration from a snapshot. * Loading server instance with configuration from a log. * Configuration changes (remove + add node). * Leader elections don't lead to RPC configuration changes. * Voter <-> learner node transitions also don't change RPC configuration. * Reverting uncommitted configuration changes updates RPC configuration accordingly (two cases: revert to snapshot config or committed state from the log). A few more refactorings are made along the way to be able to reuse some existing functions from `replication_test` in `rpc_test` implementation. Please note, though, that there are still some functions that are borrowed from `replication_test` but not yet extracted to common helpers. This is mostly because RPC tests doesn't need all the complexity that `replication_test` has, thus, some helpers are copied in a reduced form. It would take some effort to refactor these bits to fit both `replication_test` and `rpc_test` without sacrificing convenience. This will probably be addressed in another series later. * manmanson/raft-rpc-tests-v9-alt3: raft: add tests for RPC module test: add CHECK_EVENTUALLY_EQUAL utility macro raft: replication_test: reset test rpc network between test runs raft: replication_test: extract tickers initialization into a separate func raft: replication_test: support passing custom `apply_fn` to `change_configuration()` raft: replication_test: introduce `test_server` aggregate struct raft: replication_test: support voter<->learner configuration changes raft: remove duplicate `create_command` function from `replication_test` raft: avoid 'using' statements in raft testing helpers header	2021-05-24 14:44:37 +02:00
Benny Halevy	56d3cb514a	sstables: parse statistics: improve error handling Properly return malformed_sstable_exception if the statistics file fails to parse. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210524113808.973951-1-bhalevy@scylladb.com>	2021-05-24 15:12:48 +03:00
Nadav Har'El	5da0ad2ebc	Merge branch 'coverage-py-missing-features/v1' of https://github.com/denesb/scylla into next This patchset adds the missing features noted by the patchset introducing it, namely: * The ability to run a test through `coverage.py`, automating the entire process of setting up the environment, running the test and generating the report. This is possible with the new `--run` command line argument. It supports either generating a report immediately after running the provided test or just doing the running part, allowing the user to generate the report after having run all the tests they wanted to. * A tweakable verbosity level. It is also possible to specify a subset of the profiling data as input for the report. The documentation was also completed, with examples for all the intended uses-cases. With these changes, `coverage.py` is considered mature, the remaining rough edges being located in other scripts (`tests.py` and `configure.py`). It is now possible to generate a coverage report for any test desired. Also on: https://github.com/denesb/scylla.git coverage-py-missing-features/v1 Botond Dénes (5): scripts/coverage.py: allow specifying the input files to generate the report from scripts/coverage.py: add capability of running a test directly scripts/coverage.py: add --verbose parameter scripts/coverage.py: document intended uses-cases HACKING.md: redirect to ./coverage.py for more details scripts/coverage.py \| 143 +++++++++++++++++++++++++++++++++++++++----- HACKING.md \| 19 +----- 2 files changed, 129 insertions(+), 33 deletions(-)	2021-05-24 14:54:28 +03:00
Avi Kivity	50f3bbc359	Merge "treewide: various header cleanups" from Pavel S " The patch set is an assorted collection of header cleanups, e.g: * Reduce number of boost includes in header files * Switch to forward declarations in some places A quick measurement was performed to see if these changes provide any improvement in build times (ccache cleaned and existing build products wiped out). The results are posted below (`/usr/bin/time -v ninja dev-build`) for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX). Before: Command being timed: "ninja dev-build" User time (seconds): 28262.47 System time (seconds): 824.85 Percent of CPU this job got: 3979% Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2129888 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1402838 Minor (reclaiming a frame) page faults: 124265412 Voluntary context switches: 1879279 Involuntary context switches: 1159999 Swaps: 0 File system inputs: 0 File system outputs: 11806272 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 After: Command being timed: "ninja dev-build" User time (seconds): 26270.81 System time (seconds): 767.01 Percent of CPU this job got: 3905% Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2117608 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1400189 Minor (reclaiming a frame) page faults: 117570335 Voluntary context switches: 1870631 Involuntary context switches: 1154535 Swaps: 0 File system inputs: 0 File system outputs: 11777280 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 The observed improvement is about 5% of total wall clock time for `dev-build` target. Also, all commits make sure that headers stay self-sufficient, which would help to further improve the situation in the future. " * 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla: transport: remove extraneous `qos/service_level_controller` includes from headers treewide: remove evidently unneded storage_proxy includes from some places service_level_controller: remove extraneous `service/storage_service.hh` include sstables/writer: remove extraneous `service/storage_service.hh` include treewide: remove extraneous database.hh includes from headers treewide: reduce boost headers usage in scylla header files cql3: remove extraneous includes from some headers cql3: various forward declaration cleanups utils: add missing <limits> header in `extremum_tracking.hh`	2021-05-24 14:24:20 +03:00
Yaron Kaikov	dd453ffe6a	install.sh: Setup aio-max-nr upon installation This is a follow up change to #8512. Let's add aio conf file during scylla installation process and make sure we also remove this file when uninstall Scylla As per Avi Kivity's suggestion, let's set aio value as static configuration, and make it large enough to work with 500 cpus. Closes #8650	2021-05-24 14:24:20 +03:00
Takuya ASADA	3d307919c3	scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem Currently, var-lib-scylla.mount may fails because it can start before MDRAID volume initialized. We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for device become available, but systemd manual says it automatically configure dependency for mount unit when we specify filesystem path by "absolute path of a device node". So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>. Fixes #8279 Closes #8681	2021-05-24 14:24:08 +03:00
Nadav Har'El	5206665b15	test/cql-pytest: fix python mistake in an xfailing test The xfailing test cassandra_tests/validation/entities/collections_test.py:: testSelectionOfEmptyCollections had a Python mistake (using {} instead of set() for an empty set), which resulted in its failure when run against Cassandra. After this patch it passes on Cassandra and fails on Scylla - as expected (this is why it is marked xfail). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-05-24 13:14:54 +03:00
Nadav Har'El	f26b31e950	test/cql-pytest: mark some tests with scylla-only Tests which are known to test a Scylla-only feature (such as CDC) or to rely on a known and difference between Scylla and Cassandra should be marked "scylla-only", so they are skipped when running the tests against Cassandra (test/cql-pytest/run-cassandra) instead of reporting errors. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-05-24 13:03:48 +03:00
Nadav Har'El	c8117584e3	test/cql-pytest: clean up test_create_large_static_cells_and_rows The test test_create_large_static_cells_and_rows had its own implementation of "nodetool flush" using Scylla's REST API. Now that we have a nodetool.flush() function for general use in cql-pytest, let's use it and save a bit of duplication. Another benefit is that now this test can be run (and pass) against Cassandra. To allow this test to run on Cassandra, I had to remove a "USING TIMEOUT" which wasn't necessary for this test, and is not a feature supported by Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-05-24 12:31:51 +03:00
Eliran Sinvani	f2091bb227	workload prioritization: Reduce the logging sensitivity to "glitches" in availability Before this patch every failure to pull the configuration have been reported as a warning. However this is confusing for users for two reasons: 1. It pollutes the logs if the configuration is polled which is Scylla's mode of operation. Such a line is logged every failed iteration. 2. It confuses users because even though this level is warning, it logs out an exception and the log message contains the word failed. We see it a lot during QA runs and customer questions from the field. Point 2 is only solvable by reducing the verbosity of the logged information, which will make debugging harder. Point 1 is addressed here in the following manner, first the one shot configuration pull function is not handling the exception itself, this is OK because it is harmless to fail once or twice in a row in configuration pulling like in every other query, the caller is the one that will be responsible to handle the exception and log the information. Second, the polling loop capture the exceptions being thrown from the configuration pulling function and only report an error with the latest exception if the polling has failed in consecutive iterations over the last 90 seconds. This value was chosen because this is about the empirical worst case time that it takes to a node to notice one of the other nodes in the cluster is down (hence not querying it). It is not important for the user or to us to be notified on temporary glitches in availability (through this error at least) and since we are eventually consistent is ok that some nodes will catch up with the configuration later than others. We also set a threshold in which if the configuration still couldn't be retrieved then the logging level is bumped to ERROR. Closes #8574	2021-05-24 10:51:47 +02:00
Piotr Sarna	17f4a55664	qos: remove unused with_user_service_level helper This helper function is an artifact of forward-porting service levels, and it wouldn't even compile when used because of mismatched function declarations. It's not used anywhere in the open-source code, so it's removed to avoid future merge conflicts. Message-Id: <c9f421d0c4c1a807626775d324fd35b4c72505fe.1621845335.git.sarna@scylladb.com>	2021-05-24 11:42:51 +03:00
Michał Chojnowski	4b60e69e7c	keys, compound: take the argument to from_single_value() by reference Since serialize_value needs to copy the values to a bigger buffer anyway, there is no point in copying the argument higher in the call chain. This patch eliminates some pointless copies, for example in alternator/executor.cc Closes #8688	2021-05-24 11:20:24 +03:00
Asias He	425e3b1182	gossip: Introduce direct failure detector Currently, gossip uses the updates of the gossip heartbeat from gossip messages to decide if a node is up or down. This means if a node is actually down but the gossip messages are delayed in the network, the marking of node down can be delayed. For example, a node sends 20 gossip messages in 20 seconds before it is dead. Each message is delayed 15 seconds by the network for some reason. A node receives those delayed messages one after another. Those delayed messages will prevent this node from being marked as down. Because heartbeat update is received just before the threshold to mark a node down is triggered which is around 20 seconds by default. As a result, this node will not be marked as down in 20 * 15 seconds = 300 seconds, much longer than the ~20 seconds node down detection time in normal cases. In this patch, a new failure detector is implemented. - Direct detection The existing failure detector can get gossip heartbeat updates indirectly. For example: Node A can talk to Node B Node B can talk to Node C Node A can not talk to Node C, due to network issues Node A will not mark Node B to be down because Node A can get heart beat of Node C from node B indirectly. This indirect detection is not very useful because when Node A decides if it should send requests to Node C, the requests from Node A to C will fail while Node A thinks it can communicate with Node C. This patch changes the failure detection to be direct. It uses the existing gossip echo message to detect directly. Gossip echo messages will be sent to peer nodes periodically. A peer node will be marked as down if a timeout threshold has been meet. Since the failure detection is peer to peer, it avoids the delayed message issue mentioned above. - Parallel detection The old failure detector uses shard zero only. This new failure detector utilizes all the shards to perform the failure detection, each shard handling a subset of live nodes. For example, if the cluster has 32 nodes and each node has 16 shards, each shard will handle only 2 nodes. With a 16 nodes cluster, each node has 16 shards, each shard will handle only one peer node. A gossip message will be sent to peer nodes every 2 seconds. The extra echo messages traffic produced compared to the old failure detector is negligible. - Deterministic detection Users can configure the failure_detector_timeout_in_ms to set the threshold to mark a node down. It is the maximum time between two successful echo message before gossip marks a node down. It is easier to understand than the old phi_convict_threshold. - Compatible This patch only uses the existing gossip echo message. Nodes with or without this patch can work together. Fixes #8488 Closes #8036	2021-05-24 10:47:06 +03:00
Piotr Sarna	890ed201fd	Merge 'Enable -Wunused-private-field warning' from Avi Kivity The -Wunused-private-field was squelched when we switched to clang to make the change easier. But it is a useful warning, so re-enable it. It found a serious bug (#8682) and a few minor instances of waste. Closes #8683 * github.com:scylladb/scylla: build: enable -Wunused-private-field warning test: drop unused fields table: drop unused field database_sstable_write_monitor::_compaction_manager streaming: drop unused fields sstables: mx reader: drop unused _column_value_length field sstables: index_consumer: drop unused max_quantity field compaction: resharding_compaction: drop unused _shard field compaction: compaction_read_monitor: drop unused _compaction_manager field raft: raft_services: drop unused _gossiper field repair: drop unused _nr_peer_nodes field redis: drop unused fields _storage_proxy and _requests_blocked_memory mutation_rebuilder: drop unused field _remaining_limit db: data_listeners: remove unused field _db cql3: insert_json_statement: note bug with unused _if_not_exists cql3: authorized_prepared_statement_cache: drop unused field _logger auth: service_level_resource_view: drop unused field _resource	2021-05-24 09:21:10 +02:00
Michał Chojnowski	03faf139c8	collection_mutation: don't linearize collection values Yet another patch preventing potentially large allocations. Currently, collection_mutation{_view,}_description linearize each collection value during deserialization. It's not unthinkable that a user adds a large element to a list or a map, so let's avoid that. This patch removes the dependency on linearizing_input_stream, which does not provide a way to read fragmented subbuffers, and replaces it with a new helper, which does. (Extending linearizing_input_stream is not viable without rewriting it completely). Only linearization of collection values is corrected in this patch. Collection keys are still linearized. Storing them in managed_bytes is likely to be more harmful than helpful, because large map keys are extremely unlikely, and UUIDs, which are used as keys in lists, do not fit into manages_bytes's small value optimization, so this would incure an extra allocation for every list element. Note: this patch leaves utils/linearizing_input_stream.hh unused. Refs: #8120 Closes #8690	2021-05-23 12:16:56 +03:00
Michał Chojnowski	65be64d0fe	types: don't linearize values in abstract_type::hash Yet another patch aiming to prevent potentially large allocations. abstract_type::hash somehow evaded the anti-linearization patches until now. Fix that. Note that decimals and varints are still linearized, but we leave it be, under the assumption that nobody inserts 128KiB-large varints into a database. Refs: #8120 Closes #8689	2021-05-23 12:11:53 +03:00
Michał Chojnowski	ffdb706984	keys, compound: eliminate some careless copies of shared pointers Using `auto` copies the shared pointers. We don't want that, so let's use `const auto&`. Closes #8686	2021-05-23 12:11:46 +03:00
Michał Chojnowski	ebe485953a	types: fix a case of type punning via union Type punning via unions is legal in C, but illegal (undefined behaviour) in C++. Use the legal bit_cast instead. Closes #8685	2021-05-23 10:12:56 +03:00
Michał Chojnowski	e4405692ae	types: remove some dead code Closes #8684	2021-05-23 09:57:30 +03:00

1 2 3 4 5 ...

26659 Commits