scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 00:50:35 +00:00

Author	SHA1	Message	Date
Wojciech Mitros	4f0b3539c5	cql3: add type_parser::parse() method taking user_types_metadata In a future patch, we don't have access to a `user_types_storage` while we want to parse a type, but we do have access to a `user_types_metadata`, which is enough to parse the type. We add a variant of the `type_parser::parse()` that takes a `user_types_metadata` instead of a `user_types_storage` to be able to parse a type also in the described context.	2023-03-10 11:02:33 +01:00
Avi Kivity	c5e4bf51bd	Introduce mutation/ module Move mutation-related files to a new mutation/ directory. The names are kept in the global namespace to reduce churn; the names are unambiguous in any case. mutation_reader remains in the readers/ module. mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this patch. This is a step forward towards librarization or modularization of the source base. Closes #12788	2023-02-14 11:19:03 +02:00
Nadav Har'El	6a45881d22	Merge 'functions: handle replacing UDFs used in UDAs' from Wojciech Mitros This patch is based on #12681, only last 3 commits are relevant. As described in #12709, currently, when a UDF used in a UDA is replaced, the UDA is not updated until the whole node is restarted. This patch fixes the issue by updating all affected UDAs when a UDF is replaced. Additionally, it includes a few convenience changes Closes #12710 * github.com:scylladb/scylladb: uda: change the UDF used in a UDA if it's replaced functions: add helper same_signature method uda: return aggregate functions as shared pointers	2023-02-13 16:30:24 +02:00
Avi Kivity	fd4ee4878a	Revert "storage_service: Enable Repair Based Node Operations (RBNO) by default for all node ops" This reverts commit `e7d5e508bc`. It ends up failing continuous integration tests randomly. We don't know if it's uncovering an existing bug, or if RBNO itself is broken, but for now we need to revert it to unblock progress.	2023-02-09 10:30:26 +02:00
Kefu Chai	afd1221b53	commitlog: mark request_controller_timeout_exception_factory::timeout() noexcept request_controller_timeout_exception_factory::timeout() creates an instance of `request_controller_timed_out_error` whose ctor is default-created by compiler from that of timed_out_error, which is in turn default-created from the one of `std::exception`. and `std::exception::exception` does not throw. so it's safe to mark this factory method `noexcept`. with this specifier, we don't need to worry about the exception thrown by it, and don't need to handle them if any in `seastar::semaphore`, where `timeout()` is called for the customized exception. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12759	2023-02-07 14:38:54 +02:00
Wojciech Mitros	20069372e7	uda: return aggregate functions as shared pointers We will want to reuse the functions that we get from an aggregate without making a deep copy, and it's only possible if we get pointers from the aggregate instead of actual values.	2023-02-07 10:15:09 +01:00
Botond Dénes	8efa9b0904	Merge 'Avoid qctx from view-builder methods of system_keyspace' from Pavel Emelyanov The system_keyspace defines several auxiliary methods to help view_builder update system.scylla_views_builds_in_progress and system.built_views tables. All use global qctx thing. It only takes adding view_builder -> system_keyspace dependency in order to de-static all those helpers and let them use query-processor from it, not the qctx. Closes #12728 * github.com:scylladb/scylladb: system_keysace: De-static calls that update view-building tables storage_service: Coroutinize mark_existing_views_as_built() api: Unset column_famliy endpoints api: Carry sharded<db::system_keyspace> reference over view_builder: Add system_keyspace dependency	2023-02-06 12:44:40 +02:00
Avi Kivity	1e6cc9ca61	Merge 'storage_service: Enable Repair Based Node Operations (RBNO) by default for all node ops' from Asias He Since `97bb2e47ff` (storage_service: Enable Repair Based Node Operations (RBNO) by default for replace), RBNO was enabled by default for replace ops. After more testing, we decided to enable repair based node operations by default for all node operations. Closes #12173 * github.com:scylladb/scylladb: storage_service: Enable Repair Based Node Operations (RBNO) by default for all node ops test: Increase START_TIMEOUT test: Increase max-networking-io-control-blocks storage_service: Check node has left in node_ops_cmd::decommission_done repair: Use remote dc neighbors for everywhere strategy	2023-02-06 10:42:52 +02:00
Avi Kivity	f73e2c992f	Merge 'Keep range tombstones with rows in memtables and cache' from Tomasz Grabiec This series switches memtable and cache to use a new representation for mutation data, called `mutation_partition_v2`. In this representation, range tombstone information is stored in the same tree as rows, attached to row entries. Each entry has a new tombstone field, which represents range tombstone part which applies to the interval between this entry and the previous one. See docs/dev/mvcc.md for more details about the format. The transient mutation object still uses the old model in order to avoid work needed to adapt old code to the new model. It may also be a good idea to live with two models, since the transient mutation has different requirements and thus different trade-offs can be made. Transient mutation doesn't need to support eviction and strong exception guarantees, so its algorithms and in-memory representation can be simpler. This allows us to incrementally evict range tombstone information. Before this series, range tombstones were accumulated and evicted only when the whole partition entry was evicted. This could lead to inefficient use of cache memory. Another advantage of the new representation is that reads don't have to lookup range tombstone information in a different tree while reading. This leads to simpler and more efficient readers. There are several disadvantages too. Firstly, rows_entry is now larger by 16 bytes. Secondly, update algorithms are more complex because they need to deoverlap range tombstone information. Also, to handle preemption and provide strong exception guarantees, update algorithms may need to allocate sentinel entries, which adds complexity and reduces performance. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Remaining work: - performance optimizations to apply_monotonically() to avoid regressions - performance testing - preemption support in apply_to_incomplete (cache update from memtable) Fixes #2578 Fixes #3288 Fixes #10587 Closes #12048 * github.com:scylladb/scylladb: test: mvcc: Extend some scenarios with exhaustive consistency checks on eviction test: mvcc: Extract mvcc_container::allocate_in_region() row_cache, lru: Introduce evict_shallow() test: mvcc: Avoid copies of mutation under failure injection test: mvcc: Add missing logalloc::reclaim_lock to test_apply_is_atomic mutation_partition_v2: Avoid full scan when applying mutation to non-evictable Pass is_evictable to apply() tests: mutation_partition_v2: Introduce test_external_memory_usage_v2 mirroring the test for v1 tests: mutation: Fix test_external_memory_usage() to not measure mutation object footprint tests: mutation_partition_v2: Add test for exception safety of mutation merging tests: Add tests for the mutation_partition_v2 model mutation_partition_v2: Implement compact() cache_tracker: Extract insert(mutation_partition_v2&) mvcc, mutation_partition: Document guarantees in case merging succeeds mutation_partition_v2: Accept arbitrary preemption source in apply_monotonically() mutation_partition_v2: Simplify get_continuity() row_cache: Distinguish dummy insertion site in trace log db: Use mutation_partition_v2 in mvcc range_tombstone_change_merger: Introduce peek() readers: Extract range_tombstone_change_merger mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots mvcc: partition_snapshot_row_cursor: Support digest calculation mutation_partition_v2: Store range tombstones together with rows db: Introduce mutation_partition_v2 doc: Introduce docs/dev/mvcc.md db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU db: Print range_tombstone bounds as position_in_partition test: memtable_test: Relax test_segment_migration_during_flush test: cache_flat_mutation_reader: Avoid timestamp clash test: cache_flat_mutation_reader_test: Use monotonic timestamps when inserting rows test: mvcc: Fix sporadic failures due to compact_for_compaction() test: lib: random_mutation_generator: Produce partition tombstone less often test: lib: random_utils: Introduce with_probability() test: lib: Improve error message in has_same_continuity() test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker test: mvcc: Insert entries in the tracker test: mvcc_test: Do not set dummy::no on non-clustering rows mutation_partition: Print full position in error report in append_clustered_row() db: mutation_cleaner: Extract make_region_space_guard() position_in_partition: Optimize equality check mvcc: Fix version merging state resetting mutation_partition: apply_resume: Mark operator bool() as explicit	2023-02-05 22:33:10 +02:00
Pavel Emelyanov	d021aaf34d	system_keysace: De-static calls that update view-building tables There's a bunch of them used by mainly view_builder and also by the API and storage_service. All use global qctx to make its job, now when the callers have main-local sharded<system_keysace> references they can be made non-static. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-02-03 21:56:54 +03:00
Pavel Emelyanov	bbbeba103b	view_builder: Add system_keyspace dependency The view builder updates system.scylla_views_builds_in_progress and .built_views tables and thus needs the system keyspace instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-02-03 18:55:58 +03:00
Asias He	e7d5e508bc	storage_service: Enable Repair Based Node Operations (RBNO) by default for all node ops Since `97bb2e47ff` (storage_service: Enable Repair Based Node Operations (RBNO) by default for replace), RBNO was enabled by default for replace ops. After more testing, we decided to enable repair based node operations by default for all node operations.	2023-02-03 21:15:08 +08:00
Michał Chojnowski	fa7e904cd6	commitlog: fix total_size_on_disk accounting after segment file removal Currently, segment file removal first calls `f.remove_file()` and does `total_size_on_disk -= f.known_size()` later. However, `remove_file()` resets `known_size` to 0, so in effect the freed space in not accounted for. `total_size_on_disk` is not just a metric. It is also responsible for deciding whether a segment should be recycled -- it is recycled only if `total_size_on_disk - known_size < max_disk_size`. Therefore this bug has dire performance consequences: if `total_size_on_disk - known_size` ever exceeds `max_disk_size`, the recycling of commitlog segments will stop permanently, because `total_size_on_disk - known_size` will never go back below `max_disk_size` due to the accounting bug. All new segments from this point will be allocated from scratch. The bug was uncovered by a QA performance test. It isn't easy to trigger -- it took the test 7 hours of constant high load to step into it. However, the fact that the effect is permanent, and degrades the performance of the cluster silently, makes the bug potentially quite severe. The bug can be easily spotted with Prometheus as infinitely rising `commitlog_total_size_on_disk` on the affected shards. Fixes #12645 Closes #12646	2023-01-30 12:20:04 +02:00
Avi Kivity	5d914adcef	Merge 'view: row_lock: lock_ck: find or construct row_lock under partition lock' from Benny Halevy Since we're potentially searching the row_lock in parallel to acquiring the read_lock on the partition, we're racing with row_locker::unlock that may erase the _row_locks entry for the same clustering key, since there is no lock to protect it up until the partition lock has been acquired and the lock_partition future is resolved. This change moves the code to search for or allocate the row lock _after_ the partition lock has been acquired to make sure we're synchronously starting the read/write lock function on it, without yielding, to prevent this use-after-free. This adds an allocation for copying the clustering key in advance that wasn't needed before if the lock for it was already found, but the view building is not on the hot path so we can tolerate that. This is required on top of `5007ded2c1` as seen in https://github.com/scylladb/scylladb/issues/12632 which is closely related to #12168 but demonstrates a different race causing use-after-free. Fixes #12632 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12639 * github.com:scylladb/scylladb: view: row_lock: lock_ck: try_emplace row_lock entry view: row_lock: lock_ck: find or construct row_lock under partition lock	2023-01-29 18:38:14 +02:00
Tomasz Grabiec	7bb975eb22	row_cache, lru: Introduce evict_shallow() Will be used by MVCC tests which don't want (can't) deal with the row_cache as the container but work with the partition_entry directly. Currently, rows_entry::on_evicted() assumes that it's embedded in row_cache and would segfault when trying to evict the contining partition entry which is not embedded in row_cache. The solution is to call evict_shallow() from mvcc_tests, which does not attempt to evict the containing partition_entry.	2023-01-27 21:56:31 +01:00
Tomasz Grabiec	4317999ca4	cache_tracker: Extract insert(mutation_partition_v2&)	2023-01-27 21:56:31 +01:00
Tomasz Grabiec	026f8cc1e7	db: Use mutation_partition_v2 in mvcc This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes #2578 Fixes #3288 Fixes #10587	2023-01-27 21:56:28 +01:00
Tomasz Grabiec	27882ff19e	db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU	2023-01-27 19:15:39 +01:00
Botond Dénes	84a69b6adb	db/view/view_update_check: check_needs_view_update_path(): filter out non-member hosts We currently don't clean up the system_distributed.view_build_status table after removed nodes. This can cause false-positive check for whether view update generation is needed for streaming. The proper fix is to clean up this table, but that will be more involved, it even when done, it might not be immediate. So until then and to be on the safe side, filter out entries belonging to unknown hosts from said table. Fixes: #11905 Refs: #11836 Closes #11860	2023-01-27 17:12:45 +03:00
Benny Halevy	d2893f93cb	view: row_lock: lock_ck: try_emplace row_lock entry Use same method as the two-level lock at the partition level. try_emplace will either use an existing entry, if found, or create a new entry otherwise. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-27 13:51:48 +02:00
Benny Halevy	4b5e324ecb	view: row_lock: lock_ck: find or construct row_lock under partition lock Since we're potentially searching the row_lock in parallel to acquiring the read_lock on the partition, we're racing with row_locker::unlock that may erase the _row_locks entry for the same clustering key, since there is no lock to protect it up until the partition lock has been acquired and the lock_partition future is resolved. This change moves the code to search for or allocate the row lock _after_ the partition lock has been acquired to make sure we're synchronously starting the read/write lock function on it, without yielding, to prevent this use-after-free. This adds an allocation for copying the clustering key in advance even if a row_lock entry already exists, that wasn't needed before. It only us slows down (a bit) when there is contention and the lock already existed when we want to go locking. In the fast path there is no contention and then the code already had to create the lock and copy the key. In any case, the penalty of copying the key once is tiny compared to the rest of the work that view updates are doing. This is required on top of `5007ded2c1` as seen in https://github.com/scylladb/scylladb/issues/12632 which is closely related to #12168 but demonstrates a different race causing use-after-free. Fixes #12632 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-27 13:51:46 +02:00
Kefu Chai	4a0134a097	db: system_keyspace: take the reserved_memory into account before this change, we returns the total memory managed by Seastar in the "total" field in system.memory. but this value only reflect the total memory managed by Seastar's allocator. if `reserve_additional_memory` is set when starting app_template, Seastar's memory subsystem just reserves a chunk of memory of this specified size for system, and takes the remaining memory. since `f05d612da8`, we set this value to 50MB for wasmtime runtime. hence the test of `TestRuntimeInfoTable.test_default_content` in dtest fails. the test expects the size passed via the option of `--memory` to be identical to the value reported by system.memory's "total" field. after this change, the "total" field takes the reserved memory for wasm udf into account. the "total" field should reflect the total size of memory used by Scylla, no matter how we use a certain portion of the allocated memory. Fixes #12522 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12573	2023-01-24 14:07:44 +02:00
Pavel Emelyanov	be2ad2fe99	system_keyspace: De-static system_keyspace::increment_and_get_generation It's only called on cluster-join from storage_service which has the local system_keyspace reference and it's already started by that time. This allows removing few more occurrences of global qctx. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-01-20 17:24:22 +03:00
Pavel Emelyanov	4c4f8aa3e1	system_keyspace: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-01-20 17:24:22 +03:00
Pavel Emelyanov	b0edc07339	system_keyspace: Coroutinize system_keyspace::increment_and_get_generation Just unroll the fn().then({ fn2().then().then(); }); chain. Indentation is deliberately left broken. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-01-20 17:24:10 +03:00
Avi Kivity	aab5954cfb	Merge 'reader_concurrency_semaphore: add more layers of defense against OOM' from Botond Dénes The reader concurrency semaphore has no mechanism to limit the memory consumption of already admitted read. Once memory collective memory consumption of all the admitted reads is above the limit, all it can do is to not admit any more. Sometimes this is not enough and the memory consumption of the already admitted reads balloons to the point of OOMing the node. This pull-request offers a solution to this: it introduces two more layers of defense above this: a soft and a hard limit. Both are multipliers applied on the semaphores normal memory limit. When the soft limit threshold is surpassed, all readers but one are blocked via a new blocking `request_memory()` call which is used by the `tracking_file_impl`. The reader to be allowed to proceed is chosen at random, it is the first reader which happens to request memory after the limit is surpassed. This is both very simple and should avoid situations where the algorithm choosing the reader to be allowed to proceed chooses a reader which will then always time out. When the hard limit threshold is surpassed, `reader_concurrency_semaphore::consume()` starts throwing `std::bad_alloc`. This again will result in eliminating whichever reader was unlucky enough to request memory at the right moment. With this, the semaphore is now effectively enforcing an upper bound for memory consumption, defined by the hard limit. Refs: https://github.com/scylladb/scylladb/issues/11927 Closes #11955 * github.com:scylladb/scylladb: test: reader_concurrency_semaphore_test: add tests for semaphore memory limits reader_permit: expose operator<<(reader_permit::state) reader_permit: add id() accessor reader_concurrency_semaphore: add foreach_permit() reader_concurrency_semaphore: document the new memory limits reader_concurrency_semaphore: add OOM killer reader_concurrency_semaphore: make consume() and signal() private test: stop using reader_concurrency_semaphore::{consume,signal}() directly reader_concurrency_semaphore: move consume() out-of-line reader_permit: consume(): make it exception-safe reader_permit: resource_units::reset(): only call consume() if needed reader_concurrency_semaphore: tracked_file_impl: use request_memory() reader_concurrency_semaphore: add request_memory() reader_concurrency_semaphore: wrap wait list reader_concurrency_semaphore: add {serialize,kill}_limit_multiplier parameters test/boost/reader_concurrency_semaphore_test: dummy_file_impl: don't use hardoced buffer size reader_permit: add make_new_tracked_temporary_buffer() reader_permit: add get_state() accessor reader_permit: resource_units: add constructor for already consumed res reader_permit: resource_units: remove noexcept qualifier from constructor db/config: introduce reader_concurrency_semaphore_{serialize,kill}_limit_multiplier scylla-gdb.py: scylla-memory: extract semaphore stats formatting code scylla-gdb.py: fix spelling of "graphviz"	2023-01-18 17:02:55 +02:00
Kamil Braun	a483915c62	db: system_keyspace: add a virtual table with raft configuration Add a new virtual table `system.raft_state` that shows the currently operating Raft configuration for each present group. The schema is the same as `system.raft_snapshot_config` (the latter shows the config from the last snapshot). In the future we plan to add more columns to this table, showing more information (like the current leader and term), hence the generic name. Adding the table requires some plumbing of `sharded<raft_group_registry>&` through function parameters to make it accessible from `register_virtual_tables`, but it's mostly straightforward. Also added some APIs to `raft_group_registry` to list all groups and find a given group (returning `nullptr` if one isn't found, not throwing an exception).	2023-01-17 12:28:00 +01:00
Kamil Braun	2bfe85ce9b	db: system_keyspace: improve system.raft_snapshot_config schema Remove the `ip_addr` column which was not used. IP addresses are not part of Raft configuration now and they can change dynamically. Swap the `server_id` and `disposition` columns in the clustering key, so when querying the configuration, we first obtain all servers with the current disposition and then all servers with the previous disposition (note that a server may appear both in current and previous).	2023-01-17 12:28:00 +01:00
Nadav Har'El	5bf94ae220	cql: allow disabling of USING TIMESTAMP sanity checking As requested by issue #5619, commit `2150c0f7a2` added a sanity check for USING TIMESTAMP - the number specified in the timestamp must not be more than 3 days into the future (when viewed as a number of microseconds since the epoch). This sanity checking helps avoid some annoying client-side bugs and mis-configurations, but some users genuinely want to use arbitrary or futuristic-looking timestamps and are hindered by this sanity check (which Cassandra doesn't have, by the way). So in this patch we add a new configuration option, restrict_future_timestamp If set to "true", futuristic timestamps (more than 3 days into the future) are forbidden. The "true" setting is the default (as has been the case sinced #5619). Setting this option to "false" will allow using any 64-bit integer as a timestamp, like is allowed Cassanda (and was allowed in Scylla prior to #5619. The error message in the case where a futuristic timestamp is rejected now mentions the configuration paramter that can be used to disable this check (this, and the option's name "restrict_*", is similar to other so-called "safe mode" options). This patch also includes a test, which works in Scylla and Cassandra, with either setting of restrict_future_timestamp, checking the right thing in all these cases (the futuristic timestamp can either be written and read, or can't be written). I used this test to manually verify that the new option works, defaults to "true", and when set to "false" Scylla behaves like Cassandra. Fixes #12527 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12537	2023-01-16 23:18:56 +02:00
Botond Dénes	7eb093899a	db/config: introduce reader_concurrency_semaphore_{serialize,kill}_limit_multiplier Will be propagated to reader concurrency semaphores. Not wired in yet.	2023-01-16 02:05:27 -05:00
Benny Halevy	1577aa8098	db: config: describe replace_address* options as deprecated The replace_address options are still supported But mention in their description that they are now deprecated and the user should use replace_node_first_boot instead. While at it fix a typo in ignore_dead_nodes_for_replace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:09 +02:00
Benny Halevy	32e79185d4	db: config: add replace_node_first_boot option For replacing a node given its (now unique) Host ID. The existing options for replace_address* will be deprecated in the following patches and eventually we will stop supporting them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Kamil Braun	be390285b6	db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG A single node will run a single Raft server in any given Raft group, so this column is not necessary.	2023-01-12 16:48:50 +01:00
Kamil Braun	bed555d1e5	db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config' Make it clear that the table stores the snapshot configuration, which is not necessarily the currently operating configuration (the last one appended to the log). In the future we plan to have a separate virtual table for showing the currently operating configuration, perhaps we will call it `system.raft_config`.	2023-01-12 16:21:26 +01:00
Wojciech Mitros	e558c7d988	functions: initialize aggregates on scylla start Currently, UDAs can't be reused if Scylla has been restarted since they have been created. This is caused by the missing initialization of saved UDAs that should have inserted them to the cql3::functions::functions::_declared map, that should store all (user-)created functions and aggregates. This patch adds the missing implementation in a way that's analogous to the method of inserting UDF to the _declared map. Fixes #11309	2023-01-10 17:44:18 +02:00
Nadav Har'El	d6e6820f33	Merge 'Drop support for cql binary protocols versions 1 and 2' from Avi Kivity The CQL binary protocol version 3 was introduced in 2014. All Scylla version support it, and Cassandra versions 2.1 and newer. Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer use 32-bit collection sizes. Unfortunately, we implemented support for multiple serialization formats very intrusively, by pushing the format everywhere. This avoids the need to re-serialize (sometimes) but is quite obnoxious. It's also likely to be broken, since it's almost untested and it's too easy to write cql_serialization_format::internal() instead of propagating the client specified value. Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's easy to verify that they are no longer in use on a running system by examining the `system.clients` table before upgrade. Fixes #10607 Closes #12432 * github.com:scylladb/scylladb: treewide: drop cql_serialization_format cql: modification_statement: drop protocol check for LWT transport: drop cql protocol versions 1 and 2	2023-01-09 18:52:41 +02:00
Wojciech Mitros	f05d612da8	wasm: limit memory allocated using mmap The wasmtime runtime allocates memory for the executable code of the WASM programs using mmap and not the seastar allocator. As a result, the memory that Scylla actually uses becomes not only the memory preallocated for the seastar allocator but the sum of that and the memory allocated for executable codes by the WASM runtime. To keep limiting the memory used by Scylla, we measure how much memory do the WASM programs use and if they use too much, compiled WASM UDFs (modules) that are currently not in use are evicted to make room. To evict a module it is required to evict all instances of this module (the underlying implementation of modules and instances uses shared pointers to the executable code). For this reason, we add reference counts to modules. Each instance using a module is a reference. When an instance is destroyed, a reference is removed. If all references to a module are removed, the executable code for this module is deallocated. The eviction of a module is actually acheved by eviction of all its references. When we want to free memory for a new module we repeatedly evict instances from the wasm_instance_cache using its LRU strategy until some module loses all its instances. This process may not succeed if the instances currently in use (so not in the cache) use too much memory - in this case the query also fails. Otherwise the new module is added to the tracking system. This strategy may evict some instances unnecessarily, but evicting modules should not happen frequently, and any more efficient solution requires an even bigger intervention into the code.	2023-01-06 14:07:29 +01:00
Wojciech Mitros	b8d28a95bf	wasm: add configuration options for instance cache and udf execution Different users may require different limits for their UDFs. This patch allows them to configure the size of their cache of wasm, the maximum size of indivitual instances stored in the cache, the time after which the instances are evicted, the fuel that all wasm UDFs are allowed to consume before yielding (for the control of latency), the fuel that wasm UDFs are allowed to consume in total (to allow performing longer computations in the UDF without detecting an infinite loop) and the hard limit of the size of UDFs that are executed (to avoid large allocations)	2023-01-06 14:07:27 +01:00
Avi Kivity	2739ac66ed	treewide: drop cql_serialization_format Now that we don't accept cql protocol version 1 or 2, we can drop cql_serialization format everywhere, except when in the IDL (since it's part of the inter-node protocol). A few functions had duplicate versions, one with and one without a cql_serialization_format parameter. They are deduplicated. Care is taken that `partition_slice`, which communicates the cql_serialization_format across nodes, still presents a valid cql_serialization_format to other nodes when transmitting itself and rejects protocol 1 and 2 serialization\ format when receiving. The IDL is unchanged. One test checking the 16-bit serialization format is removed.	2023-01-03 19:54:13 +02:00
Gleb Natapov	1688163233	raft: replace experimental raft option with dedicated flag Unlike other experimental feature we want to raft to be optional even after it leaves experimental mode. For that we need to have a separate option to enable it. The patch adds the binary option "consistent-cluster-management" for that.	2023-01-03 11:15:11 +02:00
Gleb Natapov	84eb5924ac	system_keyspace: remove redundant include storage_proxy.hh is included twice Message-Id: <20221228144944.3299711-4-gleb@scylladb.com>	2023-01-02 11:39:22 +02:00
Avi Kivity	eced91b575	Revert "view: coroutinize maybe_mark_view_as_built" This reverts commit `ac2e2f8883`. It causes a regression ("std::bad_variant_access in load_view_build_progress"). Commit `2978052113` (a reindent) is also reverted as part of the process. Fixes #12395	2022-12-28 15:36:05 +02:00
Raphael S. Carvalho	d9ab59043e	db: Add config for setting static number of compaction groups This new option allows user to control the number of compaction groups per table per shard. It's 0 by default which implies a single compaction group, as is today. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:24 -03:00
Raphael S. Carvalho	ef8f542d75	replica: Adapt table::active_memtable() to compaction groups active_memtable() was fine to a single group, but with multiple groups, there will be one active memtable per group. Let's change the interface to reflect that. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:14 -03:00
Michał Chojnowski	b52bd9ef6a	db: commitlog: remove unused max_active_writes() Dead and misleading code. Closes #12327	2022-12-16 10:23:03 +02:00
Pavel Emelyanov	d561495f0d	Merge 'topology: get rid of pending state' from Benny Halevy Now, with `a44ca06906`, is_normal_token_owner that replaced is_member does not rely anymore on the pending status of endpoints in topology. With that we can get rid of this state and just keep all endpoints we know about in the topology. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12294 * github.com:scylladb/scylladb: topology: get rid of pending state topology: debug log update and remove endpoint	2022-12-14 19:28:35 +03:00
Benny Halevy	bdb6550305	view: row_locker: add latency_stats_tracker Refactor the existing stats tracking and updating code into struct latency_stats_tracker and while at it, count lock_acquisitions only on success. Decrement operations_currently_waiting_for_lock in the destructor so it's always balanced with the uncoditional increment in the ctor. As for updating estimated_waiting_for_lock, it is always updated in the dtor, both on success and failure since the wait for the lock happened, whether waiting timed out or not. Fixes #12190 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12225	2022-12-14 17:37:22 +02:00
Nadav Har'El	92d03be37b	materialized view: fix bug in some large modifications to base partitions Sometimes a single modification to a base partition requires updates to a large number of view rows. A common example is deletion of a base partition containing many rows. A large BATCH is also possible. To avoid large allocations, we split the large amount of work into batch of 100 (max_rows_for_view_updates) rows each. The existing code assumed an empty result from one of these batches meant that we are done. But this assumption was incorrect: There are several cases when a base-table update may not need a view update to be generated (see can_skip_view_updates()) so if all 100 rows in a batch were skipped, the view update stopped prematurely. This patch includes two tests showing when this bug can happen - one test using a partition deletion with a USING TIMESTAMP causing the deletion to not affect the first 100 rows, and a second test using a specially-crafed large BATCH. These use cases are fairly esoteric, but in fact hit a user in the wild, which led to the discovery of this bug. The fix is fairly simple: To detect when build_some() is done it is no longer enough to check if it returned zero view-update rows; Rather, it explicitly returns whether or not it is done as an std::optional. The patch includes several tests for this bug, which pass on Cassandra, failed on Scylla before this patch, and pass with this patch. Fixes #12297. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12305	2022-12-14 14:50:38 +02:00
Benny Halevy	68141d0aac	topology: get rid of pending state Now, with `a44ca06906`, is_normal_token_owner that replaced is_member does not rely anymore on the pending status of endpoints in topology. With that we can get rid of this state and just keep all endpoints we know about in the topology. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-13 14:17:18 +02:00
Avi Kivity	75e469193b	Merge 'Use Host ID as Raft ID' from Kamil Braun Thanks to #12250, Host IDs uniquely identify nodes. We can use them as Raft IDs which simplifies the code and makes reasoning about it easier, because Host IDs are always guaranteed to be present (while Raft IDs may be missing during upgrade). Fixes: https://github.com/scylladb/scylladb/issues/12204 Closes #12275 * github.com:scylladb/scylladb: service/raft: raft_group0: take `raft::server_id` parameter in `remove_from_group0` gms, service: stop gossiping and storing RAFT_SERVER_ID Revert "gms/gossiper: fetch RAFT_SERVER_ID during shadow round" service: use HOST_ID instead of RAFT_SERVER_ID during replace service/raft: use gossiped HOST_ID instead of RAFT_SERVER_ID to update Raft address map main: use Host ID as Raft ID	2022-12-13 13:39:41 +02:00

1 2 3 4 5 ...

2874 Commits