scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-29 11:10:40 +00:00

Author	SHA1	Message	Date
Patryk Jędrzejczak	4cd5847761	config: add schema_commitlog_segment_size_in_mb variable In #14668, we have decided to introduce a new scylla.yaml variable for the schema commitlog segment size. The segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. Therefore, increasing the schema commitlog segment size is sometimes necessary. (cherry picked from commit `5b167a4ad7`)	2023-08-02 18:05:39 +02:00
Nadav Har'El	e34c62c567	Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions. This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed. Fixes: #14819 Closes #14821 * github.com:scylladb/scylladb: test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations db/view/view_updating_consumer: account for the size of mutations mutation/mutation_rebuilder*: return const mutation& from consume_new_partition() mutation/mutation: add memory_usage() (cherry picked from commit `056d04954c`)	2023-07-31 03:43:44 -04:00
Michał Chojnowski	75933b9906	view_updating_consumer: make buffer limit a variable The limit doesn't change at runtime, but we this patch makes it variable for unit testing purposes.	2023-07-11 09:44:00 +02:00
Michał Chojnowski	fc7b02c8e4	view: fix range tombstone handling on flushes in view_updating_consumer View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes #14503	2023-07-11 09:44:00 +02:00
Botond Dénes	486483b379	Merge '[Backport 5.2]: node ops backports' from Benny Halevy This branch backports to branch-5.2 several fixes related to node operations: - `ba919aa88a` (PR #12980; Fixes: #11011, #12969) - `53636167ca` (part of PR #12970; Fixes: #12764, #12956) - `5856e69462` (part of PR #12970) - `2b44631ded` (PR #13028; Fixes: #12989) - `6373452b31` (PR #12799; Fixes #12798) Closes #13531 * github.com:scylladb/scylladb: Merge 'Do not mask node operation errors' from Benny Halevy Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec storage_service: Wait for normal state handler to finish in replace storage_service: Wait for normal state handler to finish in bootstrap storage_service: Send heartbeat earlier for node ops	2023-05-17 16:46:49 +03:00
Raphael S. Carvalho	26b4d2c3c1	db/view/build_progress_virtual_reader: Fix use-after-move use-after-free in ctor, which potentially leads to a failure when locating table from moved schema object. static report In file included from db/system_keyspace.cc:51: ./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed] _db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS), Fixes #13395. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `1ecba373d6`)	2023-05-15 20:26:01 +03:00
Benny Halevy	5785550e24	view: view_builder: start: demote sleep_aborted log error This is not really an error, so print it in debug log_level rather than error log_level. Fixes #13374 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #13462 (cherry picked from commit `cc42f00232`)	2023-05-14 21:21:59 +03:00
Marcin Maliszkiewicz	a2fed1588e	db: view: use deferred_close for closing staging_sstable_reader When consume_in_thread throws the reader should still be closed. Related https://github.com/scylladb/scylla-enterprise/issues/2661 Closes #13398 Refs: scylladb/scylla-enterprise#2661 Fixes: #13413 (cherry picked from commit `99f8d7dcbe`)	2023-05-08 09:41:07 +03:00
Kamil Braun	42fd3704e4	Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec This patch fixes a problem which affects decommission and removenode which may lead to data consistency problems under conditions which lead one of the nodes to unliaterally decide to abort the node operation without the coordinator noticing. If this happens during streaming, the node operation coordinator would proceed to make a change in the gossiper, and only later dectect that one of the nodes aborted during sending of decommission_done or removenode_done command. That's too late, because the operation will be finalized by all the nodes once gossip propagates. It's unsafe to finalize the operation while another node aborted. The other node reverted to the old topolgy, with which they were running for some time, without considering the pending replica when handling requests. As a result, we may end up with consistency issues. Writes made by those coordinators may not be replicated to CL replicas in the new topology. Streaming may have missed to replicate those writes depending on timing. It's possible that some node aborts but streaming succeeds if the abort is not due to network problems, or if the network problems are transient and/or localized and affect only heartbeats. There is no way to revert after we commit the node operation to the gossiper, so it's ok to close node_ops sessions before making the change to the gossiper, and thus detect aborts and prevent later aborts after the change in the gossiper is made. This is already done during bootstrap (RBNO enabled) and replacenode. This patch canges removenode to also take this approach by moving sending of remove_done earlier. We cannot take this approach with decommission easily, because decommission_done command includes a wait for the node to leave the ring, which won't happen before the change to the gossiper is made. Separating this from decommission_done would require protocol changes. This patch adds a second-best solution, which is to check if sessions are still there right before making a change to the gossiper, leaving decommission_done where it was. The race can still happen, but the time window is now much smaller. The PR also lays down infrastructure which enables testing the scenarios. It makes node ops watchdog periods configurable, and adds error injections. Fixes #12989 Refs #12969 Closes #13028 * github.com:scylladb/scylladb: storage_service: node ops: Extract node_ops_insert() to reduce code duplication storage_service: Make node operations safer by detecting asymmetric abort storage_service: node ops: Add error injections service: node_ops: Make watchdog and heartbeat intervals configurable (cherry picked from commit `2b44631ded`)	2023-04-30 18:58:28 +03:00
Botond Dénes	50095cc3a5	Merge 'db: system_keyspace: use microsecond resolution for group0_history range tombstone' from Kamil Braun in `make_group0_history_state_id_mutation`, when adding a new entry to the group 0 history table, if the parameter `gc_older_than` is engaged, we create a range tombstone in the mutation which deletes entries older than the new one by `gc_older_than`. In particular if `gc_older_than = 0`, we want to delete all older entries. There was a subtle bug there: we were using millisecond resolution when generating the tombstone, while the provided state IDs used microsecond resolution. On a super fast machine it could happen that we managed to perform two schema changes in a single millisecond; this happened sometimes in `group0_test.test_group0_history_clearing_old_entries` on our new CI/promotion machines, causing the test to fail because the tombstone didn't clear the entry correspodning to the previous schema change when performing the next schema change (since they happened in the same millisecond). Use microsecond resolution to fix that. The consecutive state IDs used in group 0 mutations are guaranteed to be strictly monotonic at microsecond resolution (see `generate_group0_state_id` in service/raft/raft_group0_client.cc). Fixes #13594 Closes #13604 * github.com:scylladb/scylladb: db: system_keyspace: use microsecond resolution for group0_history range tombstone utils: UUID_gen: accept decimicroseconds in min_time_UUID (cherry picked from commit `10c1f1dc80`)	2023-04-23 16:03:02 +03:00
Wojciech Mitros	5fd4bb853b	uda: return aggregate functions as shared pointers We will want to reuse the functions that we get from an aggregate without making a deep copy, and it's only possible if we get pointers from the aggregate instead of actual values. (cherry picked from commit `20069372e7`)	2023-04-17 13:14:24 +02:00
Botond Dénes	128050e984	Merge 'commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off' from Calle Wilund Fixes #12810 We did not update total_size_on_disk in commitlog totals when use o_dsync was off. This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments. Closes #12950 * github.com:scylladb/scylladb: commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off commitlog: change type of stored size (cherry picked from commit `e70be47276`)	2023-04-03 08:57:43 +03:00
Botond Dénes	e380c24c69	Merge 'Improve database shutdown verbosity' from Pavel Emelyanov The `database::stop` method is sometimes hanging and it's always hard to spot where exactly it sleeps. Few more logging messages would make this much simpler. refs: #13100 refs: #10941 Closes #13141 * github.com:scylladb/scylladb: database: Increase verbosity of database::stop() method large_data_handler: Increase verbosity on shutdown large_data_handler: Coroutinize .stop() method (cherry picked from commit `e22b27a107`)	2023-03-30 17:01:24 +03:00
Botond Dénes	c013336121	db/view/view_update_check: check_needs_view_update_path(): filter out non-member hosts We currently don't clean up the system_distributed.view_build_status table after removed nodes. This can cause false-positive check for whether view update generation is needed for streaming. The proper fix is to clean up this table, but that will be more involved, it even when done, it might not be immediate. So until then and to be on the safe side, filter out entries belonging to unknown hosts from said table. Fixes: #11905 Refs: #11836 Closes #11860 (cherry picked from commit `84a69b6adb`)	2023-03-22 09:03:50 +02:00
Botond Dénes	bd4f9e3615	Merge 'readers/nonforwarding: don't emit partition_end on next_partition,fast_forward_to' from Gusev Petr The series fixes the `make_nonforwardable` reader, it shouldn't emit `partition_end` for previous partition after `next_partition()` and `fast_forward_to()` Fixes: #12249 Closes #12978 * github.com:scylladb/scylladb: flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE make_nonforwardable: test through run_mutation_source_tests make_nonforwardable: next_partition and fast_forward_to when single_partition is true make_forwardable: fix next_partition flat_mutation_reader_v2: drop forward_buffer_to nonforwardable reader: fix indentation nonforwardable reader: refactor, extract reset_partition nonforwardable reader: add more tests nonforwardable reader: no partition_end after fast_forward_to() nonforwardable reader: no partition_end after next_partition() nonforwardable reader: no partition_end for empty reader row_cache: pass partition_start though nonforwardable reader (cherry picked from commit `46efdfa1a1`)	2023-03-16 10:42:03 +02:00
Kefu Chai	b2699743cc	db: system_keyspace: take the reserved_memory into account before this change, we returns the total memory managed by Seastar in the "total" field in system.memory. but this value only reflect the total memory managed by Seastar's allocator. if `reserve_additional_memory` is set when starting app_template, Seastar's memory subsystem just reserves a chunk of memory of this specified size for system, and takes the remaining memory. since `f05d612da8`, we set this value to 50MB for wasmtime runtime. hence the test of `TestRuntimeInfoTable.test_default_content` in dtest fails. the test expects the size passed via the option of `--memory` to be identical to the value reported by system.memory's "total" field. after this change, the "total" field takes the reserved memory for wasm udf into account. the "total" field should reflect the total size of memory used by Scylla, no matter how we use a certain portion of the allocated memory. Fixes #12522 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12573 (cherry picked from commit `4a0134a097`)	2023-02-05 18:30:05 +02:00
Benny Halevy	0f9fe61d91	view: row_lock: lock_ck: find or construct row_lock under partition lock Since we're potentially searching the row_lock in parallel to acquiring the read_lock on the partition, we're racing with row_locker::unlock that may erase the _row_locks entry for the same clustering key, since there is no lock to protect it up until the partition lock has been acquired and the lock_partition future is resolved. This change moves the code to search for or allocate the row lock _after_ the partition lock has been acquired to make sure we're synchronously starting the read/write lock function on it, without yielding, to prevent this use-after-free. This adds an allocation for copying the clustering key in advance even if a row_lock entry already exists, that wasn't needed before. It only us slows down (a bit) when there is contention and the lock already existed when we want to go locking. In the fast path there is no contention and then the code already had to create the lock and copy the key. In any case, the penalty of copying the key once is tiny compared to the rest of the work that view updates are doing. This is required on top of `5007ded2c1` as seen in https://github.com/scylladb/scylladb/issues/12632 which is closely related to #12168 but demonstrates a different race causing use-after-free. Fixes #12632 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `4b5e324ecb`)	2023-02-05 17:22:31 +02:00
Michał Chojnowski	608ef92a71	commitlog: fix total_size_on_disk accounting after segment file removal Currently, segment file removal first calls `f.remove_file()` and does `total_size_on_disk -= f.known_size()` later. However, `remove_file()` resets `known_size` to 0, so in effect the freed space in not accounted for. `total_size_on_disk` is not just a metric. It is also responsible for deciding whether a segment should be recycled -- it is recycled only if `total_size_on_disk - known_size < max_disk_size`. Therefore this bug has dire performance consequences: if `total_size_on_disk - known_size` ever exceeds `max_disk_size`, the recycling of commitlog segments will stop permanently, because `total_size_on_disk - known_size` will never go back below `max_disk_size` due to the accounting bug. All new segments from this point will be allocated from scratch. The bug was uncovered by a QA performance test. It isn't easy to trigger -- it took the test 7 hours of constant high load to step into it. However, the fact that the effect is permanent, and degrades the performance of the cluster silently, makes the bug potentially quite severe. The bug can be easily spotted with Prometheus as infinitely rising `commitlog_total_size_on_disk` on the affected shards. Fixes #12645 Closes #12646 (cherry picked from commit `fa7e904cd6`)	2023-02-01 21:54:37 +02:00
Kamil Braun	a483915c62	db: system_keyspace: add a virtual table with raft configuration Add a new virtual table `system.raft_state` that shows the currently operating Raft configuration for each present group. The schema is the same as `system.raft_snapshot_config` (the latter shows the config from the last snapshot). In the future we plan to add more columns to this table, showing more information (like the current leader and term), hence the generic name. Adding the table requires some plumbing of `sharded<raft_group_registry>&` through function parameters to make it accessible from `register_virtual_tables`, but it's mostly straightforward. Also added some APIs to `raft_group_registry` to list all groups and find a given group (returning `nullptr` if one isn't found, not throwing an exception).	2023-01-17 12:28:00 +01:00
Kamil Braun	2bfe85ce9b	db: system_keyspace: improve system.raft_snapshot_config schema Remove the `ip_addr` column which was not used. IP addresses are not part of Raft configuration now and they can change dynamically. Swap the `server_id` and `disposition` columns in the clustering key, so when querying the configuration, we first obtain all servers with the current disposition and then all servers with the previous disposition (note that a server may appear both in current and previous).	2023-01-17 12:28:00 +01:00
Nadav Har'El	5bf94ae220	cql: allow disabling of USING TIMESTAMP sanity checking As requested by issue #5619, commit `2150c0f7a2` added a sanity check for USING TIMESTAMP - the number specified in the timestamp must not be more than 3 days into the future (when viewed as a number of microseconds since the epoch). This sanity checking helps avoid some annoying client-side bugs and mis-configurations, but some users genuinely want to use arbitrary or futuristic-looking timestamps and are hindered by this sanity check (which Cassandra doesn't have, by the way). So in this patch we add a new configuration option, restrict_future_timestamp If set to "true", futuristic timestamps (more than 3 days into the future) are forbidden. The "true" setting is the default (as has been the case sinced #5619). Setting this option to "false" will allow using any 64-bit integer as a timestamp, like is allowed Cassanda (and was allowed in Scylla prior to #5619. The error message in the case where a futuristic timestamp is rejected now mentions the configuration paramter that can be used to disable this check (this, and the option's name "restrict_*", is similar to other so-called "safe mode" options). This patch also includes a test, which works in Scylla and Cassandra, with either setting of restrict_future_timestamp, checking the right thing in all these cases (the futuristic timestamp can either be written and read, or can't be written). I used this test to manually verify that the new option works, defaults to "true", and when set to "false" Scylla behaves like Cassandra. Fixes #12527 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12537	2023-01-16 23:18:56 +02:00
Benny Halevy	1577aa8098	db: config: describe replace_address* options as deprecated The replace_address options are still supported But mention in their description that they are now deprecated and the user should use replace_node_first_boot instead. While at it fix a typo in ignore_dead_nodes_for_replace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:09 +02:00
Benny Halevy	32e79185d4	db: config: add replace_node_first_boot option For replacing a node given its (now unique) Host ID. The existing options for replace_address* will be deprecated in the following patches and eventually we will stop supporting them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Kamil Braun	be390285b6	db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG A single node will run a single Raft server in any given Raft group, so this column is not necessary.	2023-01-12 16:48:50 +01:00
Kamil Braun	bed555d1e5	db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config' Make it clear that the table stores the snapshot configuration, which is not necessarily the currently operating configuration (the last one appended to the log). In the future we plan to have a separate virtual table for showing the currently operating configuration, perhaps we will call it `system.raft_config`.	2023-01-12 16:21:26 +01:00
Wojciech Mitros	e558c7d988	functions: initialize aggregates on scylla start Currently, UDAs can't be reused if Scylla has been restarted since they have been created. This is caused by the missing initialization of saved UDAs that should have inserted them to the cql3::functions::functions::_declared map, that should store all (user-)created functions and aggregates. This patch adds the missing implementation in a way that's analogous to the method of inserting UDF to the _declared map. Fixes #11309	2023-01-10 17:44:18 +02:00
Nadav Har'El	d6e6820f33	Merge 'Drop support for cql binary protocols versions 1 and 2' from Avi Kivity The CQL binary protocol version 3 was introduced in 2014. All Scylla version support it, and Cassandra versions 2.1 and newer. Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer use 32-bit collection sizes. Unfortunately, we implemented support for multiple serialization formats very intrusively, by pushing the format everywhere. This avoids the need to re-serialize (sometimes) but is quite obnoxious. It's also likely to be broken, since it's almost untested and it's too easy to write cql_serialization_format::internal() instead of propagating the client specified value. Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's easy to verify that they are no longer in use on a running system by examining the `system.clients` table before upgrade. Fixes #10607 Closes #12432 * github.com:scylladb/scylladb: treewide: drop cql_serialization_format cql: modification_statement: drop protocol check for LWT transport: drop cql protocol versions 1 and 2	2023-01-09 18:52:41 +02:00
Wojciech Mitros	f05d612da8	wasm: limit memory allocated using mmap The wasmtime runtime allocates memory for the executable code of the WASM programs using mmap and not the seastar allocator. As a result, the memory that Scylla actually uses becomes not only the memory preallocated for the seastar allocator but the sum of that and the memory allocated for executable codes by the WASM runtime. To keep limiting the memory used by Scylla, we measure how much memory do the WASM programs use and if they use too much, compiled WASM UDFs (modules) that are currently not in use are evicted to make room. To evict a module it is required to evict all instances of this module (the underlying implementation of modules and instances uses shared pointers to the executable code). For this reason, we add reference counts to modules. Each instance using a module is a reference. When an instance is destroyed, a reference is removed. If all references to a module are removed, the executable code for this module is deallocated. The eviction of a module is actually acheved by eviction of all its references. When we want to free memory for a new module we repeatedly evict instances from the wasm_instance_cache using its LRU strategy until some module loses all its instances. This process may not succeed if the instances currently in use (so not in the cache) use too much memory - in this case the query also fails. Otherwise the new module is added to the tracking system. This strategy may evict some instances unnecessarily, but evicting modules should not happen frequently, and any more efficient solution requires an even bigger intervention into the code.	2023-01-06 14:07:29 +01:00
Wojciech Mitros	b8d28a95bf	wasm: add configuration options for instance cache and udf execution Different users may require different limits for their UDFs. This patch allows them to configure the size of their cache of wasm, the maximum size of indivitual instances stored in the cache, the time after which the instances are evicted, the fuel that all wasm UDFs are allowed to consume before yielding (for the control of latency), the fuel that wasm UDFs are allowed to consume in total (to allow performing longer computations in the UDF without detecting an infinite loop) and the hard limit of the size of UDFs that are executed (to avoid large allocations)	2023-01-06 14:07:27 +01:00
Avi Kivity	2739ac66ed	treewide: drop cql_serialization_format Now that we don't accept cql protocol version 1 or 2, we can drop cql_serialization format everywhere, except when in the IDL (since it's part of the inter-node protocol). A few functions had duplicate versions, one with and one without a cql_serialization_format parameter. They are deduplicated. Care is taken that `partition_slice`, which communicates the cql_serialization_format across nodes, still presents a valid cql_serialization_format to other nodes when transmitting itself and rejects protocol 1 and 2 serialization\ format when receiving. The IDL is unchanged. One test checking the 16-bit serialization format is removed.	2023-01-03 19:54:13 +02:00
Gleb Natapov	1688163233	raft: replace experimental raft option with dedicated flag Unlike other experimental feature we want to raft to be optional even after it leaves experimental mode. For that we need to have a separate option to enable it. The patch adds the binary option "consistent-cluster-management" for that.	2023-01-03 11:15:11 +02:00
Gleb Natapov	84eb5924ac	system_keyspace: remove redundant include storage_proxy.hh is included twice Message-Id: <20221228144944.3299711-4-gleb@scylladb.com>	2023-01-02 11:39:22 +02:00
Avi Kivity	eced91b575	Revert "view: coroutinize maybe_mark_view_as_built" This reverts commit `ac2e2f8883`. It causes a regression ("std::bad_variant_access in load_view_build_progress"). Commit `2978052113` (a reindent) is also reverted as part of the process. Fixes #12395	2022-12-28 15:36:05 +02:00
Raphael S. Carvalho	d9ab59043e	db: Add config for setting static number of compaction groups This new option allows user to control the number of compaction groups per table per shard. It's 0 by default which implies a single compaction group, as is today. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:16:24 -03:00
Raphael S. Carvalho	ef8f542d75	replica: Adapt table::active_memtable() to compaction groups active_memtable() was fine to a single group, but with multiple groups, there will be one active memtable per group. Let's change the interface to reflect that. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:14 -03:00
Michał Chojnowski	b52bd9ef6a	db: commitlog: remove unused max_active_writes() Dead and misleading code. Closes #12327	2022-12-16 10:23:03 +02:00
Pavel Emelyanov	d561495f0d	Merge 'topology: get rid of pending state' from Benny Halevy Now, with `a44ca06906`, is_normal_token_owner that replaced is_member does not rely anymore on the pending status of endpoints in topology. With that we can get rid of this state and just keep all endpoints we know about in the topology. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12294 * github.com:scylladb/scylladb: topology: get rid of pending state topology: debug log update and remove endpoint	2022-12-14 19:28:35 +03:00
Benny Halevy	bdb6550305	view: row_locker: add latency_stats_tracker Refactor the existing stats tracking and updating code into struct latency_stats_tracker and while at it, count lock_acquisitions only on success. Decrement operations_currently_waiting_for_lock in the destructor so it's always balanced with the uncoditional increment in the ctor. As for updating estimated_waiting_for_lock, it is always updated in the dtor, both on success and failure since the wait for the lock happened, whether waiting timed out or not. Fixes #12190 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12225	2022-12-14 17:37:22 +02:00
Nadav Har'El	92d03be37b	materialized view: fix bug in some large modifications to base partitions Sometimes a single modification to a base partition requires updates to a large number of view rows. A common example is deletion of a base partition containing many rows. A large BATCH is also possible. To avoid large allocations, we split the large amount of work into batch of 100 (max_rows_for_view_updates) rows each. The existing code assumed an empty result from one of these batches meant that we are done. But this assumption was incorrect: There are several cases when a base-table update may not need a view update to be generated (see can_skip_view_updates()) so if all 100 rows in a batch were skipped, the view update stopped prematurely. This patch includes two tests showing when this bug can happen - one test using a partition deletion with a USING TIMESTAMP causing the deletion to not affect the first 100 rows, and a second test using a specially-crafed large BATCH. These use cases are fairly esoteric, but in fact hit a user in the wild, which led to the discovery of this bug. The fix is fairly simple: To detect when build_some() is done it is no longer enough to check if it returned zero view-update rows; Rather, it explicitly returns whether or not it is done as an std::optional. The patch includes several tests for this bug, which pass on Cassandra, failed on Scylla before this patch, and pass with this patch. Fixes #12297. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12305	2022-12-14 14:50:38 +02:00
Benny Halevy	68141d0aac	topology: get rid of pending state Now, with `a44ca06906`, is_normal_token_owner that replaced is_member does not rely anymore on the pending status of endpoints in topology. With that we can get rid of this state and just keep all endpoints we know about in the topology. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-13 14:17:18 +02:00
Avi Kivity	75e469193b	Merge 'Use Host ID as Raft ID' from Kamil Braun Thanks to #12250, Host IDs uniquely identify nodes. We can use them as Raft IDs which simplifies the code and makes reasoning about it easier, because Host IDs are always guaranteed to be present (while Raft IDs may be missing during upgrade). Fixes: https://github.com/scylladb/scylladb/issues/12204 Closes #12275 * github.com:scylladb/scylladb: service/raft: raft_group0: take `raft::server_id` parameter in `remove_from_group0` gms, service: stop gossiping and storing RAFT_SERVER_ID Revert "gms/gossiper: fetch RAFT_SERVER_ID during shadow round" service: use HOST_ID instead of RAFT_SERVER_ID during replace service/raft: use gossiped HOST_ID instead of RAFT_SERVER_ID to update Raft address map main: use Host ID as Raft ID	2022-12-13 13:39:41 +02:00
Kamil Braun	bf6679906f	gms, service: stop gossiping and storing RAFT_SERVER_ID It is equal to (if present) HOST_ID and no longer used for anything. The application state was only gossiped if `experimental-features` contained `raft`, so we can free this slot. Similarly, `raft_server_id`s were only persisted in `system.peers` if the `SUPPORTS_RAFT` cluster feature was enabled, which happened only when `experimental-features` contained `raft`. The `raft_server_id` field in the schema was also introduced recently in `master` and didn't get to be in a release yet. Given either of these reasons, we can remove this field safely.	2022-12-12 15:20:30 +01:00
Calle Wilund	e99626dc10	config: Change wording of "none" in encryption options to maybe reduce user confusion Fixes /scylladb/scylla-enterprise/issues#1262 Changes the somewhat ambiguous "none" into "not set" to clarify that "none" is not an option to be written out, but an absense of a choice (in which case you also have made a choice). Closes #12270	2022-12-12 16:14:53 +02:00
Kamil Braun	f3243ff674	main: use Host ID as Raft ID The Host ID now uniquely identifies a node (we no longer steal it during node replace) and Raft is still experimental. We can reuse the Host ID of a node as its Raft ID. This will allow us to remove and simplify a lot of code. With this we can already remove some dead code in this commit.	2022-12-12 15:14:51 +01:00
Benny Halevy	89920d47d6	db: system_keyspace: change set_local_host_id to private set_local_random_host_id Now that the local host_id is never changed externally (by the storage_service upon replace-node), the method can be made private and be used only for initializing the local host_id to a random one. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-09 08:23:31 +02:00
Nadav Har'El	4cdaba778d	Merge 'Secondary indexes on static columns' from Piotr Dulikowski This pull request introduces support for global secondary indexes based on static columns. Local secondary indexes based on secondary columns are not planned to be supported and are explicitly forbidden. Because there is only one static row per partition and local indexes require full partition key when querying, such indexes wouldn't be very useful and would only waste resources. The index table for secondary indexes on static columns, unlike other secondary indexes, do not contain clustering keys from the base table. A static column's value determines a set of full partitions, so the clustering keys would only be unnecessary. The already existing logic for querying using secondary indexes works after introducing minimal notifications. The view update generation path now works on a common representation of static and clustering rows, but the new representation allowed to keep most of the logic intact. New cql-pytests are added. All but one of the existing tests for secondary indexes on static columns - ported from Cassandra - now work and have their `xfail` marks lifted; the remaining test requires support for collection indexing, so it will start working only after #2962 is fixed. Materialized view with static rows as a key are __not__ implemented in this PR. Fixes: #2963 Closes #11166 * github.com:scylladb/scylladb: test_materialized_view: verify that static columns are not allowed test_secondary_index: add (currently failing) test for static index paging test_secondary_index: add more tests for secondary indexes on static columns cassandra_tests: enable existing tests for static columns create_index_statement: lift restriction on secondary indexes on static rows db/view: fetch and process static rows when building indexes gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature create_index_statement: disallow creation of local indexes with static columns select_statement: prepare paging for indexes on static columns select_statement: do not attempt to fetch clustering columns from secondary index's table secondary_index_manager: don't add clustering key columns to index table of static column index replica/table: adjust the view read-before-write to return static rows when needed db/view: process static rows in view_update_builder::on_results db/view: adjust existing view update generation path to use clustering_or_static_row column_computation: adjust to use clustering_or_static_row db/view: add clustering_or_static_row deletable_row: add column_kind parameter to is_live view_info: adjust view_column to accept column_kind db/view: base_dependent_view_info: split non-pk columns into regular and static	2022-12-08 09:54:05 +02:00
Benny Halevy	a076ceef97	view: row_lock: lock_ck: reindent Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-12-07 19:27:30 +02:00
Benny Halevy	5007ded2c1	view: row_lock: lock_ck: serialize partition and row locking The problematic scenario this patch fixes might happen due to unfortunate serialization of locks/unlocks between lock_pk and lock_ck, as follows: 1. lock_pk acquires an exclusive lock on the partition. 2.a lock_ck attempts to acquire shared lock on the partition and any lock on the row. both cases currently use a fiber returning a future<rwlock::holder>. 2.b since the partition is locked, the lock_partition times out returning an exceptional future. lock_row has no such problem and succeeds, returning a future holding a rwlock::holder, pointing to the row lock. 3.a the lock_holder previously returned by lock_pk is destroyed, calling `row_locker::unlock` 3.b row_locker::unlock sees that the partition is not locked and erases it, including the row locks it contains. 4.a when_all_succeeds continuation in lock_ck runs. Since the lock_partition future failed, it destroyes both futures. 4.b the lock_row future is destroyed with the rwlock::holder value. 4.c ~holder attempts to return the semaphore units to the row rwlock, but the latter was already destroyed in 3.b above. Acquiring the partition lock and row lock in parallel doesn't help anything, but it complicates error handling as seen above, This patch serializes acquiring the row lock in lock_ck after locking the partition to prevent the above race. This way, erasing the unlocked partition is never expected to happen while any of its rows locks is held. Fixes #12168 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12208	2022-12-06 16:29:46 +02:00
Piotr Dulikowski	86dad30b66	db/view: fetch and process static rows when building indexes This commit modifies the view builder and its consumer so that static rows are always fetched and properly processed during view build. Currently, the view builder will always fetch both static and clustering rows, regardless of the type of indexes being built. For indexes on static columns this is wasteful and could be improved so that only the types of rows relevant to indexes being built are fetched - however, doing this sounds a bit complicated and I would rather start with something simpler which has a better chance of working.	2022-12-06 11:21:16 +01:00
Piotr Dulikowski	6ab41d76e6	replica/table: adjust the view read-before-write to return static rows when needed Adjusts the read-before-write query issued in `table::do_push_view_replica_updates` so that, when needed, requests static columns and makes sure that the static row is present.	2022-12-06 11:21:16 +01:00

1 2 3 4 5 ...

2865 Commits