scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Author	SHA1	Message	Date
Botond Dénes	996e2f8048	Merge 'Handle serialized_action trigger exceptions' from Benny Halevy " which is currently unhandled from multiple call sites, leading to the following warning as seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/1094/artifact/logs-all.release.2/1643794928169_materialized_views_test.py%3A%3ATestInterruptBuildProcess%3A%3Atest_interrupt_build_process_and_resharding_half_to_max_test/node2.log ``` Scylla version 5.0.dev-0.20220201.a026b4ef4 with build-id cebf6dca8edd8df843a07e0f01a1573f1d0a6dfc starting ... WARN 2022-02-02 09:31:56,616 [shard 2] seastar - Exceptional future ignored: seastar::sleep_aborted (Sleep is aborted), backtrace: 0x463b65e 0x463bb50 0x463be58 0x426c165 0x230c744 0x42adad4 0x42aeea7 0x42cdb55 0x4281a2a /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/a026b4ef490074df0d31d4b0ed9189d0cfaa745e/scylla/libreloc/libpthread.so.0+0x9298 /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/a026b4ef490074df0d31d4b0ed9189d0cfaa745e/scylla/libreloc/libc.so.6+0x100352 -------- seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false> >(seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> ``` Decoded: ``` void seastar::backtrace(seastar::current_backtrace_tasklocal()::$_3&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:86 seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:137 seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:170 seastar::report_failed_future(std::__exception_ptr::exception_ptr const&) at ./build/release/seastar/./seastar/src/core/future.cc:210 (inlined by) seastar::report_failed_future(seastar::future_state_base::any&&) at ./build/release/seastar/./seastar/src/core/future.cc:218 seastar::future_state_base::any::check_failure() at ././seastar/include/seastar/core/future.hh:567 (inlined by) seastar::future_state::clear() at ././seastar/include/seastar/core/future.hh:609 (inlined by) ~future_state at ././seastar/include/seastar/core/future.hh:614 (inlined by) ~future at ././seastar/include/seastar/core/scheduling.hh:43 (inlined by) void seastar::futurize >::satisfy_with_result_of::then_wrapped_nrvo, seastar::future::finally_body >(seastar::future::finally_body&&)::{lambda(seastar::internal::promise_base_with_type&&, serialized_action::trigger(bool)::{lambda()#2}&, seastar::future_state&&)#1}::operator()(seastar::internal::promise_base_with_type, seastar::internal::promise_base_with_type&&, seastar::future_state::finally_body&&::monostate>) const::{lambda()#1}>(seastar::internal::promise_base_with_type, seastar::future::finally_body&&) at ././seastar/include/seastar/core/future.hh:2120 (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1667 (inlined by) seastar::continuation, seastar::future::finally_body, seastar::future::then_wrapped_nrvo, serialized_action::trigger(bool)::{lambda()#2}>(serialized_action::trigger(bool)::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type&&, serialized_action::trigger(bool)::{lambda()#2}&, seastar::future_state&&)#1}, void>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:767 seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2344 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2754 seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2923 operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4128 (inlined by) void std::__invoke_impl(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_100&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:61 (inlined by) std::enable_if, void>::type std::__invoke_r(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_100&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:111 (inlined by) std::_Function_handler::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:291 std::function::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:560 (inlined by) seastar::posix_thread::start_routine(void) at ./build/release/seastar/./seastar/src/core/posix.cc:60 ``` This series handles exception handling to serialized actions triggers that don't handle exceptions. Test: unit(dev) " tag 'handle-serialized_action-trigger-exception-v1' of https://github.com/bhalevy/scylla: migration_manager: passive_announce(version): handle exception view_builder: do_build_step: handle unexpected exceptions storage_service: no need to include utils/serialized_action.hh	2022-02-03 10:17:59 +02:00
Calle Wilund	1e66043412	commitlog: Fix double clearing of _segment_allocating shared_future. Fixes #10020 Previous fix `445e1d3` tried to close one double invocation, but added another, since it failed to ensure all potential nullings of the opt shared_future happened before a new allocator could reset it. This simplifies the code by making clearing the shared_future a pre-requisite for resolving its contents (as read by waiters). Also removes any need for try-catch etc. Closes #10024	2022-02-02 23:26:17 +02:00
Benny Halevy	b56b10a4bb	view_builder: do_build_step: handle unexpected exceptions Exception are handled by do_build_step in principle, Yet if an unhandled exception escapes handling (e.g. get_units(_sem, 1) fails on a broken semaphore) we should warn about it since the _build_step.trigger() calls do no handle exceptions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-02 14:54:19 +02:00
Pavel Emelyanov	a026b4ef49	config: Add option to disable config updates via CQL The system.config table allows changing config parameters, but this change doesn't survive restarts and is considered to be dangerous (sometimes). Add an option to disable the table updates. The option is LiveUpdate and can be set to false via CQL too (once). fixes #9976 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220201121114.32503-1-xemul@scylladb.com>	2022-02-01 14:30:47 +02:00
Calle Wilund	445e1d3e41	commitlog: Ensure we never have more than one new_segment call at a time Refs #9896 Found by @eliransin. Call to new_segment was wrapped in with_timeout. This means that if primary caller timed out, we would leave new_segment calls running, but potentially issue new ones for next caller. This could lead to reserve segment queue being read simultanously. And it is not what we want. Change to always use the shared_future wait, all callers, and clear it only on result (exception or segment) Closes #10001	2022-01-31 16:50:22 +02:00
Tomasz Grabiec	ba6c02b38a	Merge "Clear old entries from group 0 history when performing schema changes" from Kamil When performing a change through group 0 (which right now means schema changes), clear entries from group 0 history table which are older than one week. This is done by including an appropriate range tombstone in the group 0 history table mutation. * kbr/g0-history-gc-v2: idl: group0_state_machine: fix license blurb test: unit test for clearing old entries in group0 history service: migration_manager: clear old entries from group 0 history when announcing	2022-01-26 16:12:40 +01:00
Gleb Natapov	579dcf187a	raft: allow an option to persist commit index Raft does not need to persist the commit index since a restarted node will either learn it from an append message from a leader or (if entire cluster is restarted and hence there is no leader) new leader will figure it out after contacting a quorum. But some users may want to be able to bring their local state machine to a state as up-to-date as it was before restart as soon as possible without any external communication. For them this patch introduces new persistence API that allows saving and restoring last seen committed index. Message-Id: <YfFD53oS2j1My0p/@scylladb.com>	2022-01-26 14:06:39 +01:00
Calle Wilund	43f51e9639	commitlog: Ensure we don't run continuation (task switch) with queues modified Fixes #9955 In #9348 we handled the problem of failing to delete segment files on disk, and the need to recompute disk footprint to keep data flow consistent across intermittent failures. However, because _reserve_segments and _recycled_segments are queues, we have to empty them to inspect the contents. One would think it is ok for these queues to be empty for a while, whilst we do some recaclulating, including disk listing -> continuation switching. But then one (i.e. I) misses the fact that these queues use the pop_eventually mechanism, which does _not_ handle a scenario where we push something into an empty queue, thus triggering the future that resumes a waiting task, but then pop the element immediately, before the waiting task is run. In fact, _iff_ one does this, not only will things break, they will in fact start creating undefined behaviour, because the underlying std::queue<T, circular_buffer> will _not_ do any bounds checks on the pop/push operations -> we will pop an empty queue, immediately making it non-empty, but using undefined memory (with luck null/zeroes). Strictly speakging, seastar::queue::pop_eventually should be fixed to handle the scenario, but nontheless we can fix the usage here as well, by simply copy objects and do the calculation "in background" while we potentially start popping queue again. Closes #9966	2022-01-26 13:51:01 +02:00
Kamil Braun	e9083433a8	service: migration_manager: clear old entries from group 0 history when announcing When performing a change through group 0 (which right now only covers schema changes), clear entries from group 0 history table which are older than one week. This is done by including an appropriate range tombstone in the group 0 history table mutation.	2022-01-25 13:11:14 +01:00
Kamil Braun	044e05b0d9	service: migration_manager: `announce`: take a description parameter The description parameter is used for the group 0 history mutation. The default is empty, in which case the mutation will leave the description column as `null`. I filled the parameter in some easy places as an example and left the rest for a follow-up. This is how it looks now in a fresh cluster with a single statement performed by the user: cqlsh> select * from system.group0_history ; key \| state_id \| description ---------+--------------------------------------+------------------------------------------------------ history \| 9ec29cac-7547-11ec-cfd6-77bb9e31c952 \| CQL DDL statement history \| 9beb2526-7547-11ec-7b3e-3b198c757ef2 \| null history \| 9be937b6-7547-11ec-3b19-97e88bd1ca6f \| null history \| 9be784ca-7547-11ec-f297-f40f0073038e \| null history \| 9be52e14-7547-11ec-f7c5-af15a1a2de8c \| null history \| 9be335dc-7547-11ec-0b6d-f9798d005fb0 \| null history \| 9be160c2-7547-11ec-e0ea-29f4272345de \| null history \| 9bdf300e-7547-11ec-3d3f-e577a2e31ffd \| null history \| 9bdd2ea8-7547-11ec-c25d-8e297b77380e \| null history \| 9bdb925a-7547-11ec-d754-aa2cc394a22c \| null history \| 9bd8d830-7547-11ec-1550-5fd155e6cd86 \| null history \| 9bd36666-7547-11ec-230c-8702bc785cb9 \| Add new columns to system_distributed.service_levels history \| 9bd0a156-7547-11ec-a834-85eac94fd3b8 \| Create system_distributed(_everywhere) tables history \| 9bcfef18-7547-11ec-76d9-c23dfa1b3e6a \| Create system_distributed_everywhere keyspace history \| 9bcec89a-7547-11ec-e1b4-34e0010b4183 \| Create system_distributed keyspace	2022-01-24 15:20:37 +01:00
Kamil Braun	fad72daeb4	db: system_keyspace: introduce `system.group0_history` table This table will contain a history of all group 0 changes applied through Raft. With each change is an associated unique ID, which also identifies the state of all group 0 tables (including schema tables) after this change is applied, assuming that all such changes are serialized through Raft (they will be eventually). We will use these state IDs to check if a given change is still valid at the moment it is applied (in `group0_state_machine::apply`), i.e. that there wasn't a concurrent change that happened between creating this change and applying it (which may invalidate it).	2022-01-24 15:20:37 +01:00
Kamil Braun	a664ac7ba5	treewide: require `group0_guard` when performing schema changes `announce` now takes a `group0_guard` by value. `group0_guard` can only be obtained through `migration_manager::start_group0_operation` and moved, it cannot be constructed outside `migration_manager`. The guard will be a method of ensuring linearizability for group 0 operations.	2022-01-24 15:20:35 +01:00
Kamil Braun	86762a1dd9	service: migration_manager: rename `schema_read_barrier` to `start_group0_operation` 1. Generalize the name so it mentions group 0, which schema will be a strict subset of. 2. Remove the fact that it performs a "read barrier" from the name. The function will be used in general to ensure linearizability of group0 operations - both reads and writes. "Read barrier" is Raft-specific terminology, so it can be thought of as an implementation detail.	2022-01-24 15:12:50 +01:00
Kamil Braun	283ac7fefe	treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions The functions which prepare schema change mutations (such as `prepare_new_column_family_announcement`) would use internally generated timestamps for these mutations. When schema changes are managed by group 0 we want to ensure that timestamps of mutations applied through Raft are monotonic. We will generate these timestamps at call sites and pass them into the `prepare_` functions. This commit prepares the APIs.	2022-01-24 15:12:50 +01:00
Kamil Braun	0af5f74871	db: system_distributed_keyspace: use current time when creating mutations in `start()` When creating or updating internal distributed tables in `system_distributed_keyspace::start()`, hardcoded timestamps were used. There two reasons for this: - to protect against issue #2129, where nodes would start without synchronizing schema with the existing cluster, creating the tables again, which would override any manual user changes to these tables. The solution was to use small timestamps (like api::min_timestamp) - the user-created schema mutations would always 'win' (because when they were created, they used current time). - to eliminate unnecessary schema sync. If two nodes created these tables concurrently with different timestamps, the schemas would formally be different and would need to merge. This could happen during upgrades when we upgraded from a version which doesn't have these tables or doesn't have some columns. The #2129 workaround is no longer necessary: when nodes start they always have to sync schema with existing nodes; we also don't allow bootstrapping nodes in parallel. The second problem would happen during parallel bootstrap, which we don't allow, or during parallel upgrade. The procedure we recommend is rolling upgrade - where nodes are upgraded one by one. In this case only one node is going to create/update the tables; following upgraded nodes will sync schema first and notice they don't need to do anything. So if procedures are followed correctly, the workaround is not needed. If someone doesn't follow the procedures and upgrades nodes in parallel, these additional schema synchronizations are not a big cost, so the workaround doesn't give us much in this case as well. When schema changes are performed by Raft group 0, certain constraints are placed on the timestamps used for mutations. For this we'll need to be able to use timestamps which are generated based on current time.	2022-01-24 15:12:49 +01:00
Nadav Har'El	7cb6250c40	Merge 'snapshot_ctl: true_snapshots_size: fix space accounting' from Benny Halevy This pull request fixes two preexisting issues related to snapshot_ctl::true_snapshots_size https://github.com/scylladb/scylla/issues/9897 https://github.com/scylladb/scylla/issues/9898 And adds a couple unit tests to tests the snapshot_ctl functionality. Test: unit(dev), database_test.{test_snapshot_ctl_details,test_snapshot_ctl_true_snapshots_size}(debug) Closes #9899 * github.com:scylladb/scylla: table: get_snapshot_details: count allocated_size snapshot_ctl: cleanup true_snapshots_size snpashot_ctl: true_snapshots_size: do not map_reduce across all shards	2022-01-19 11:57:15 +02:00
Benny Halevy	5440739e1b	snapshot_ctl: cleanup true_snapshots_size Cleanup indentation and s/local_total/total/ as it is Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-19 07:50:53 +02:00
Benny Halevy	5db3cbe1e4	snpashot_ctl: true_snapshots_size: do not map_reduce across all shards snapshot_ctl uses map_reduce over all database shards, each counting the size of the snapshots directory, which is shared, not per-shard. So the total live size returned by it is multiples by the number of shards. Add a unit test to test that. Fixes #9897 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-19 07:50:53 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Avi Kivity	985403ab99	view: convert build_progress_virtual_reader to flat_mutation_reader_v2 build_progress_virtual_reader is a virtual reader that trims off the last clustering key column from an underlying base table. It is here converted to flat_mutation_reader_v2. Because range_tombstone_change uses position_in_partition, not clustering_key_prefix, we need a new adjust_ckey() overload. Note the transformation is likely incorrect. When trimming the last clustering key column, an inclusive bound changes should change to exclusive. However, the original code did not do this, so we don't fix it here. It's immaterial anyway since the base table doesn't include range tombstones. Test: unit (dev) (which has a test for this reader) Closes #9913	2022-01-17 10:31:37 +02:00
Botond Dénes	c727360eca	db: convert data listeners to v2 To remove yet another back-and-forth conversion in table::make_reader_v2(). Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220114085551.565752-1-bdenes@scylladb.com>	2022-01-14 13:57:44 +02:00
Botond Dénes	d6efe27545	Merge 'db: config: add a flag to disable new reversed reads algorithm' from Kamil Braun Just in case the new algorithm turns out to be buggy, or give a performance regression, add a flag to fall-back to the old algorithm for use in the field. Closes #9908 * github.com:scylladb/scylla: db: config: add a flag to disable new reversed reads algorithm replica: table: remove obsolete comment about reversed reads	2022-01-13 23:09:02 +02:00
Kamil Braun	e98711cfcb	db: config: add a flag to disable new reversed reads algorithm Just in case the new algorithm turns out to be buggy, or give a performance regression, add a flag to fall-back to the old algorithm for use in the field.	2022-01-12 18:59:19 +01:00
Gleb Natapov	9ce62bcc33	system_distributed_keyspace: move schema creation code to use raft	2022-01-12 16:40:06 +02:00
Gleb Natapov	459539e812	migration_manager: do not allow creating keyspace with arbitrary timestamp This was needed to fix issue #2129 which was only manifest itself with auto_bootstrap set to false. The option is ignored now and we always wait for schema to synch during boot.	2022-01-12 16:33:15 +02:00
Nadav Har'El	7a9f69ec38	Merge 'lister cleanup and test' from Benny Halevy Split off of #9835. The series removes extraneous includes of lister.hh from header files and adds a unit test for lister::scan_dir to test throwing an exception from the walker function passed to `scan_dir`. Test: unit(dev) Closes #9885 * github.com:scylladb/scylla: test: add lister_list lister: add more overloads of fs::path operator/ for std::string and string_view resource_manager: remove unnecessary include of lister.hh from header file sstables: sstable_directory: remove unncessary include of lister.hh from header file	2022-01-12 08:20:07 +01:00
Benny Halevy	f4cd535e3d	resource_manager: remove unnecessary include of lister.hh from header file But define namespace fs = std::filesystem in the header since many use sites already depend on it and it's a convention throught scylla's code. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-11 17:04:16 +02:00
Michael Livshin	91d38ef2a9	view_update_generator: remove unneeded call to downgrade_to_v1() Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-01-11 10:49:26 +02:00
Avi Kivity	bbad8f4677	replica: move ::database, ::keyspace, and ::table to replica namespace Move replica-oriented classes to the replica namespace. The main classes moved are ::database, ::keyspace, and ::table, but a few ancillary classes are also moved. There are certainly classes that should be moved but aren't (like distributed_loader) but we have to start somewhere. References are adjusted treewide. In many cases, it is obvious that a call site should not access the replica (but the data_dictionary instead), but that is left for separate work. scylla-gdb.py is adjusted to look for both the new and old names.	2022-01-07 12:04:38 +02:00
Avi Kivity	ae3a360725	database: Move database, keyspace, table classes to replica/ directory The database, keyspace, and table classes represent the replica-only part of the objects after which they are named. Reading from a table doesn't give you the full data, just the replica's view, and it is not consistent since reconciliation is applied on the coordinator. As a first step in acknowledging this, move the related files to a replica/ subdirectory.	2022-01-06 17:07:30 +02:00
Avi Kivity	d01e1a774b	Merge 'Build performance: do not include the entire <seastar/net/ip.hh>' from Nadav Har'El The header file <seastar/net/ip.hh> is a large collection of unrelated stuff, and according to ClangBuildAnalyzer, takes 2 seconds to compile for every source file that included it - and unfortunately virtually all Scylla source files included it - through either "types.hh" or "gms/inet_address.hh". That's 2300 CPU seconds wasted. In this two-patch series we completely eliminate the inclusion of <seastar/net/ip.hh> from Scylla. We still need the ipv4_address, ipv6_address types (e.g., gms/inet_address.hh uses it to hold a node's IP address) so those were split (in a Seastar patch that is already in) from ip.hh into separate small header files that we can include. This patch reduces the entire build time (of build/dev/scylla) by 4% - reducing almost 10 sCPU minutes (!) from the build. Closes #9875 github.com:scylladb/scylla: build performance: do not include <seastar/net/ip.hh> build performance: speed up inclusion of <gm/inet_address.hh>	2022-01-05 17:55:07 +02:00
Raphael S. Carvalho	426450dc04	treewide: remove useless include of database.hh Wrote a script based on cpp-include to find places that needlessly included database.hh, which is expensive to process during build time. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220104204359.168895-1-raphaelsc@scylladb.com>	2022-01-05 10:15:19 +02:00
Nadav Har'El	3fbbad7d60	build performance: speed up inclusion of <gm/inet_address.hh> The header file <gm/inet_address.hh> is included, directly or indirectly, from 291 source files in Scylla. It is hard to reduce this number because Scylla relies heavily on IP addresses as keys to different things. So it is important that this header file be fast to include. Unfortunately it wasn't... ClangBuildAnalyzer measurements showed that each inclusion of this header file added a whopping 2 seconds (in dev build mode) to the build. A total of 600 CPU seconds - 10 CPU minutes - were spent just on this header file. It was actually worse because the build also spent additional time on template instantiation (more on this below). So in this patch we: 1. Remove some unnecessary stuff from gms/inet_address.hh, and avoid including it in one place that doesn't need it. This is just cosmetic, and doesn't significantly speed up the build. 2. Move the to_sstring() implementation for the .hh to .cc. This saves a lot of time on template instantiations - previously every source file instantiated this to_sstring(), which was slow (that "format" thing is slow). 3. Do not include <seastar/net/ip.hh> which is a huge file including half the world. All we need from it is the type "ipv4_address", so instead include just the new <seastar/net/ipv4_address.hh>. This change brings most of the performance improvement. So source files forgot to include various Seastar header files because the includes-everything ip.hh did it - so we need to add these missing includes in this patch. After this patch, ClangBuildAnalyzer's reports that the cost of inclusion of <gms/inet_address.hh> is down from 2 seconds to 0.326 seconds. Additionally the format<inet_address> template instantiation 291 times - about half a second each - is also gone. All in all, this patch should reduce around 10 CPU minutes from the build. Refs #1 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-01-04 21:07:23 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Calle Wilund	3c02cab2f7	commitlog: Don't allow error_handler to swallow exception Fixes #9798 If an exception in allocate_segment_ex is (sub)type of std::system_error, commit_error_handler might _not_ cause throw (doh), in which case the error handling code would forget the current exception and return an unusable segment. Now only used as an exception pointer replacer. Closes #9870	2022-01-03 22:46:31 +02:00
Avi Kivity	9e74556413	Merge 'Support reverse reads in the row cache natively' from Tomasz Grabiec This change makes row cache support reverse reads natively so that reversing wrappers are not needed when reading from cache and thus the read can be executed efficiently, with similar cost as the forward-order read. The database is serving reverse reads from cache by default after this. Before, it was bypassing cache by default after `703aed3277`. Refs: #1413 Tests: - unit [dev] - manual query with build/dev/scylla and cache tracing on Closes #9454 * github.com:scylladb/scylla: tests: row_cache: Extend test_concurrent_reads_and_eviction to run reverse queries row_cache: partition_snapshot_row_cursor: Print more details about the current version vector row_cache: Improve trace-level logging config: Use cache for reversed reads by default config: Adjust reversed_reads_auto_bypass_cache description row_cache: Support reverse reads natively mvcc: partition_snapshot: Support slicing range tombstones in reverse test: flat_mutation_reader_assertions: Consume expected range tombstones before end_of_partition row_cache: Log produced range tombstones test: Make produces_range_tombstone() report ck_ranges tests: lib: random_mutation_generator: Extract make_random_range_tombstone() partition_snapshot_row_cursor: Support reverse iteration utils: immutable-collection: Make movable intrusive_btree: Make default-initialized iterator cast to false	2021-12-29 16:53:25 +02:00
Tomasz Grabiec	2a3450dfb7	Merge "db: save supported features after passing gossip feature check" from Pavel Solodovnikov Move saving features to `system.local#supported_features` to the point after passing all remote feature checks in the gossiper, right before joining the ring. This makes `system.local#supported_features` column to store advertised feature set. Leave a comment in the definition of `system.local` schema to reflect that. Since the column value is not actually used anywhere for now, it shouldn't affect any tests or alter the existing behavior. Later, we can optimize the gossip communication between nodes in the cluster, removing the feature check altogether in some cases (since the column value should now be monotonic). * manmanson/save_adv_features_v2: db: save supported features after passing gossip feature check db: add `save_local_supported_features` function	2021-12-28 11:26:11 +02:00
Nadav Har'El	b8786b96f4	commitlog: fix missing wait for semaphore units Commit `dcc73c5d4e` introduced a semaphore for excluding concurrent recalculations - _reserve_recalculation_guard. Unfortunately, the two places in the code which tried to take this guard just called get_units() - which returns a future<units>, not units - and never waited for this future to become available. So this patch adds the missing "co_await" needed to wait for the units to become available. Fixes #9770. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211214122612.1462436-1-nyh@scylladb.com>	2021-12-27 16:56:30 +02:00
Pavel Solodovnikov	83862d9871	db: save supported features after passing gossip feature check Move saving features to `system.local#supported_features` to the point after passing all remote feature checks in the gossiper, right before joining the ring. This makes `system.local#supported_features` column to store advertised feature set. Leave a comment in the definition of `system.local` schema to reflect that. Since the column value is not actually used anywhere for now, it shouldn't affect any tests or alter the existing behavior. Later, we can optimize the gossip communication between nodes in the cluster, removing the feature check altogether in some cases (since the column value should now be monotonic). Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-23 12:48:37 +03:00
Pavel Solodovnikov	96799a72d9	db: add `save_local_supported_features` function This is a utility function for writing the supported feature set to the `system.local` table. Will be used to move the corresponding part from `system_keyspace::setup_version` to the gossiper after passing remote feature check, effectively making `system.local#supported_features` store the advertised features (which already passed the feature check). Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-12-20 13:31:52 +03:00
Asias He	eba4a4fba4	repair: Allow ignoring dead nodes for replace operation Consider 1) n1, n2, n3, n4, n5 2) n2 and n3 are both down 3) start n6 to replace n2 4) start n7 to replace n3 We want to replace the dead nodes n2 and n3 to fix the cluster to have 5 running nodes. Replace operation in step 3 will fail because n3 is down. We would see errors like below: replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed for replace operation are down. It is highly recommended to fix the down nodes and try again. In the above example, currently, there is no way to replace any of the dead nodes. Users can either fix one of the dead nodes and run replace or run removenode operation to remove one of the dead nodes then run replace and run bootstrap to add another node. Fixing dead nodes is always the best solution but it might not be possible. Running removenode operation is not better than running replace operation (with best effort by ignoring the other dead node) in terms of data consistency. In addition, users have to run bootstrap operation to add back the removed node. So, allowing replacing in such case is a clear win. This patch adds the --ignore-dead-nodes-for-replace option to allow run replace operation with best effort mode. Please note, use this option only if the dead nodes are completely broken and down, and there is no way to fix the node and bring it back. This also means the user has to make sure the ignored dead nodes specified are really down to avoid any data consistency issue. Fixes #9757 Closes #9758	2021-12-20 00:49:03 +02:00
Tomasz Grabiec	65a1a0247a	config: Use cache for reversed reads by default	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	9fd1120ad5	config: Adjust reversed_reads_auto_bypass_cache description Bypassing cache is no longer necessary to use native reverse readers.	2021-12-19 22:41:35 +01:00
Avi Kivity	d768e9fac5	cql3, related: switch to data_dictionary Stop using database (and including database.hh) for schema related purposes and use data_dictionary instead. data_dictionary::database::real_database() is called from several places, for these reasons: - calling yet-to-be-converted code - callers with a legitimate need to access data (e.g. system_keyspace) but with the ::database accessor removed from query_processor. We'll need to find another way to supply system_keyspace with data access. - to gain access to the wasm engine for testing whether used defined functions compile. We'll have to find another way to do this as well. The change is a straightforward replacement. One case in modification_statement had to change a capture, but everything else was just a search-and-replace. Some files that lost "database.hh" gained "mutation.hh", which they previously had access to through "database.hh".	2021-12-15 13:54:23 +02:00
Avi Kivity	3945acaa2d	data_dictionary: move keyspace_metadata to data_dictionary Like user_types_metadata, keyspace_metadata does not grant data access, just metadata, and so belongs in data_dictionary.	2021-12-15 13:52:21 +02:00
Avi Kivity	021c7593b8	data_dictionary: move user_types_metadata to new module data_dictionary The new module will contain all schema related metadata, detached from actual data access (provided by the database class). User types is the first contents to be moved to the new module.	2021-12-15 13:52:10 +02:00
Gleb Natapov	38e1f85959	migration_manager: drop view_ptr array from announce_column_family_update() No users pass it any longer.	2021-12-11 12:31:07 +02:00
Avi Kivity	f28552016f	Update seastar submodule * seastar f8a038a0a2...8d15e8e67a (21): > core/program_options: preserve defaultness of CLI arguments > log: Silence logger when logging > Include the core/loop.hh header inside when_all.hh header > http: Fix deprecated wrappers > foreign_ptr: Add concept > util: file: add read_entire_file > short_streams: move to util > Revert "Merge: file: util: add read_entire_file utilities" > foreign_ptr: declare destroy as a static method > Merge: file: util: add read_entire_file utilities > Merge "output_stream: handle close failure" from Benny > net: bring local_address() to seastar::connected_socket. > Merge "Allow programatically configuring seastar" from Botond > Merge 'core: clean up memory metric definitions' from John Spray > Add PopOS to debian list in install-dependencies.sh > Merge "make shared_mutex functions exception safe and noexcept" from Benny > on_internal_error: set_abort_on_internal_error: return current state > Implementation of iterator-range version of when_any > net: mark functions returning ethernet_address noexcept > net: ethernet_address: mark functions noexcept > shared_mutex: mark wake and unlock methods noexcept Contains patch from Botond Dénes <bdenes@scylladb.com>: db/config: configure logging based on app_template::seastar_options Scylla has its own config file which supports configuring aspects of logging, in addition to the built-in CLI logging options. When applying this configuration, the CLI provided option values have priority over the ones coming from the option file. To implement this scylla currently reads CLI options belonging to seastar from the boost program options variable map. The internal representation of CLI options however do not constitute an API of seastar and are thus subject to change (even if unlikely). This patch moves away from this practice and uses the new shiny C++ api: `app_template::seastar_options` to obtain the current logging options.	2021-12-08 14:21:11 +02:00
Botond Dénes	2e5440bdf2	Merge 'Convert compaction to flat_mutation_reader_v2' from Raphael Carvalho Since sstable reader was already converted to flat_mutation_reader_v2, compaction layer can naturally be converted too. There are many dependencies that use v1. Those strictly needed like readers in sstable set, which links compaction to sstable reader, were converted to v2 in this series. For those that aren't essential we're relying on V1<-->V2 adaptors, and conversion work on them will be postponed. Those being postponed are: scrub specialized reader (needs a validator for mutation_fragment_v2), interposer consumer, combined reader which is used by incremental selector. incremental selector itself was converted to v2. tests: unit(debug). Closes #9725 * github.com:scylladb/scylla: compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2 sstable_set: update make_crawling_reader() to flat_mutation_reader_v2 sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2 sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2 sstable_set: update incremental_reader_selector to flat_mutation_reader_v2	2021-12-07 15:17:38 +02:00
Raphael S. Carvalho	aebbe68239	sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:53 -03:00

1 2 3 4 5 ...

2413 Commits