scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 21:47:10 +00:00

Author	SHA1	Message	Date
Kefu Chai	a439ebcfce	treewide: include fmt/ranges.h and/or fmt/std.h before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we include `fmt/ranges.h` and/or `fmt/std.h` for formatting the container types, like vector, map optional and variant using {fmt} instead of the homebrew formatter based on operator<<. with this change, the changes adding fmt::formatter and the changes using ostream formatter explicitly, we are allowed to drop `FMT_DEPRECATED_OSTREAM` macro. Refs scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-04-19 22:56:16 +08:00
Kamil Braun	9c2a836607	Revert "Merge 'Drain view_builder in generic drain' from ScyllaDB" This reverts commit `298a7fcbf2`, reversing changes made to `5cf53e670d`. The change made CI flaky. Fixes: scylladb/scylladb#18278	2024-04-18 11:50:41 +02:00
Pavel Emelyanov	90593f4e82	view_builder: Generalize mark_as_built(view_ptr) method Marking is performed in two places and they can be generalized Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-05 19:56:12 +03:00
Pavel Emelyanov	3c3f2cd337	view_builder: Move mark_existing_views_as_built from storage service Now it's in the correct component Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-05 19:56:11 +03:00
Wojciech Mitros	9789a3dc7c	mv: keep semaphore units alive until the end of a remote view update When a view update has both a local and remote target endpoint, it extends the lifetime of its memory tracking semaphore units only until the end of the local update, while the resources are actually used until the remote update finishes. This patch changes the semaphore transferring so that in case of both local and remote endpoints, both view updates share the units, causing them to be released only after the update that takes longer finishes. Fixes #17890 Closes scylladb/scylladb#17891	2024-03-25 19:43:58 +02:00
Avi Kivity	72bbe75d5b	Merge 'Fix node replace with tablets for RF=N' from Tomasz Grabiec This PR fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because tablet replica allocation for rebuild will not be able to find a viable destination, as the replacing node is not considered to be a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes https://github.com/scylladb/scylladb/issues/17025 Nodes in the left state will be kept in tablet replica sets for a while after node replace is done, until the new replica is rebuilt. So we need to know about those node's location (dc, rack) for two reasons: 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first. 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement. It's ok to not know the IP, and we don't keep it. Those nodes will not be present in the IP-based replica sets, e.g. those returned by get_natural_endpoints(), only in host_id-based replica sets. storage_proxy request coordination is not affected. Nodes in the left state are still not present in token ring, and not considered to be members of the ring (datacanter endpoints excludes them). In the future we could make the change even more transparent by only loading locator::node* for those nodes and keeping node* in tablet replica sets. Currently left nodes are never removed from topology, so will accumulate in memory. We could garbage-collect them from topology coordinator if a left node is absent in any replica set. That means we need a new state - left_for_real. Closes scylladb/scylladb#17388 * github.com:scylladb/scylladb: test: py: Add test for view replica pairing after replace raft, api: Add RESTful API to query current leader of a raft group test: test_tablets_removenode: Verify replacing when there is no spare node doc: topology-on-raft: Document replace behavior with tablets tablets, raft topology: Rebuild tablets after replacing node is normal tablets: load_balancer: Access node attributes via node struct tablets: load_balancer: Extract ensure_node() mv: Switch to using host_id-based replica set effective_replication_map: Introduce host_id-based get_replicas() raft topology: Keep nodes in the left state to topology tablets: Introduce read_required_hosts()	2024-03-18 16:16:08 +02:00
Wojciech Mitros	efcb718e0a	mv: adjust memory tracking of single view updates within a batch Currently, when dividing memory tracked for a batch of updates we do not take into account the overhead that we have for processing every update. This patch adds the overhead for single updates and joins the memory calculation path for batches and their parts so that both use the same overhead. Fixes #17854 Closes scylladb/scylladb#17855	2024-03-18 14:31:54 +02:00
Tomasz Grabiec	9b656ec2aa	mv: Switch to using host_id-based replica set This is necessary to not break replica pairing between base and view. After replacing a node, tablet replica set contains for a while the replaced node which is in the left state. This node is not returned by the IP-based get_natural_endpoints() so the replica indexes would shift, changing the pairing with the view. The host_id-based replica set always has stable indexes for replicas.	2024-03-15 11:05:29 +01:00
Pavel Emelyanov	3a734facc7	view_builder: Complete build step early if reader produces nothing Builder works in "steps". Each step runs for a given base table, when a new view is created it either initiates a step or appends to currently running step. Running a step means reading mutations from local sstables reader and applying them to all views that has jumped into this step so far. When a view is added to the step it remembers the current token value the step is on. When step receives end-of-stream it rewinds to minimal-token. Rewinding is done by closing current reader and creating a new one. Each time token is advanced, all the views that meet the new token value for the second time (i.e. -- scan full round) are marked as built and are removed from step. When no views are left on step, it finishes. The above machinery can break when rewinding the end-of-stream reader. The trick is that a running step silently assumes that if the reader once produced some token (and there can be a view that remembered this token as its starting one), then after rewinding the reader would generate the same token or greater. With tablets, however, that's not the case. When a node is decommissioned tablets are cleaned and all sstables are removed. Rewinding a reader after it makes empty reader that produces no tokens from now on. Respectively, any build steps that had captured tokens prior to cleanup would get stuck forever. The fix is to check if the mutation consumer stepped at least one step forward after rewind, and if no -- complete all the attached views. fixes: #17293 Similar thing should happen if the base table is truncated with views being built from it. Testing it steps on compaction assertion elsewhere and needs more research. refs: #17543 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#17548	2024-03-12 14:58:47 +02:00
Nadav Har'El	19bcea6216	materialized views: fix rare failure caused by empty update This one-line patch fixes a failure in the dtest lwt_schema_modification_test.py::TestLWTSchemaModification ::test_table_alter_delete Where an update sometimes failed due to an internal server error, and the log had the mysterious warning message: "std::logic_error (Empty materialized view updated)" We've also seen this log-message in the past in another user's log, and never understood what it meant. It turns out that the error message was generated (and warning printed) while building view updates for a base-table mutation, and noticing that the base mutation contains an empty row - a row with no cells or tombstone or anything whatsoever. This case was deemed (8 years ago, in `d5a61a8c48`) unexpected and nonsensical, and we threw an exception. But this case actually can happen - here is how it happened in test_table_alter_delete - which is a test involving a strange combination of materialized views, LWT and schema changes: 1. A table has a materialized view, and also a regular column "int_col". 2. A background thread repeatedly drops and re-creates this column int_col. 3. Another thread deletes rows with LWT ("IF EXISTS"). 4. These LWT operations each reads the existing row, and because of repeated drop-and-recreate of the "int_col" column, sometimes this read notices that one node has a value for int_col and the other doesn't, and creates a read-repair mutation setting int_col (the difference between the two reads includes just in this column). 5. The node missing "int_col" receives this mutation which sets only int_col. It upgrade()s this mutation to its most recent schema, which doesn't have int_col, so it removes this column from the mutation row - and is left with a completely empty mutation row. This completely empty row is not useful, but upgrade() doesn't remove it. 6. The view-update generation code sees this empty base-mutation row and fails it with this std::logic_error. 7. The node which sent the read-repair mutation sees that the read repair failed, so it fails the read and therefore fails the LWT delete operation. It is this LWT operation which failed in the test, and caused the whole test to fail. The fix is trivial: an empty base-table row mutation should simply be ignored when generating view updates - it shouldn't cause any error. Before this patch, test_table_alter_delete used to fail in roughly 20% of the runs on my laptop. After this patch, I ran it 100 times without a single failure. Fixes #15228 Fixes #17549 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17607	2024-03-07 12:00:43 +02:00
Avi Kivity	51df8b9173	interval: rename nonwrapping_interval to interval Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token. We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility. Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping. We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias.	2024-02-21 19:43:17 +02:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Botond Dénes	7f17d3bb0e	replica/database: keyspace: add uses_tablets() Mirroring table::uses_tablets(), provides a convenient and -- more importabtly -- easily discoverable way to determine whether the keyspace uses tablets or not. This information is of course already available via the abstract replication strategy, but as seen in a few examples, this is not easily discoverable and sometimes people resorted to enumerating the keyspace's tables to be able to invoke table::uses_tablets().	2024-02-15 01:51:26 -05:00
Nadav Har'El	21e7deafeb	alternator, mv: fix case of two new key columns in GSI A materialized view in CQL allows AT MOST ONE view key column that wasn't a key column in the base table. This is because if there were two or more of those, the "liveness" (timestamp, ttl) of these different columns can change at every update, and it's not possible to pick what liveness to use for the view row we create. We made an exception for this rule for Alternator: DynamoDB's API allows creating a GSI whose partition key and range key are both regular columns in the base table, and we must support this. We claim that the fact that Alternator allows neither TTL (Alternator's "TTL" is a different feature) nor user-defined timestamps, does allow picking the liveness for the view row we create. But we did it wrong! We claimed in a comment - and implemented in the code before this patch - that in Alternator we can assume that both GSI key columns will have the same liveness, and in particular timestamp. But this is only true if one modifies both columns together! In fact, in general it is not true: We can have two non-key attributes 'a' and 'b' which are the GSI's key columns, and we can modify only b, without modifying a, in which case the timestamp of the view modification should be b's newer timestamp, not a's older one. The existing code took a's timestamp, assuming it will be the same as b's, which is incorrect. The result was that if we repeatedly modify only b, all view updates will receive the same timestamp (a's old timestamp), and a deletion will always win over all the modifications. This patch includes a reproducing test written by a user (@Zak-Kent) that demonstrates how after a view row is deleted it doesn't get recreated - because all the modifications use the same timestamp. The fix is, as suggested above, to use the higher of the two timestamps of both base-regular-column GSI key columns as the timestamp for the new view rows or view row deletions. The reproducer that failed before this patch passes with it. As usual, the reproducer passes on AWS DynamoDB as well, proving that the test is correct and should really work. Fixes #17119 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17172	2024-02-12 13:17:29 +02:00
Nadav Har'El	14315fcbc3	mv: fix missing view deletions in some cases of range tombstones For efficiency, if a base-table update generates many view updates that go the same partition, they are collected as one mutation. If this mutation grows too big it can lead to memory exhaustion, so since commit `7d214800d0` we split the output mutation to mutations no longer than 100 rows (max_rows_for_view_updates) each. This patch fixes a bug where this split was done incorrectly when the update involved range tombstones, a bug which was discovered by a user in a real use case (#17117). Range tombstones are read in two parts, a beginning and an end, and the code could split the processing between these two parts and the result that some of the range tombstones in update could be missed - and the view could miss some deletions that happened in the base table. This patch fixes the code in two places to avoid breaking up the processing between range tombstones: 1. The counter "_op_count" that decides where to break the output mutation should only be incremented when adding rows to this output mutation. The existing code strangely incrmented it on every read (!?) which resulted in the counter being incremented on every input fragment, and in particular could reach the limit 100 between two range tombstone pieces. 2. Moreover, the length of output was checked in the wrong place... The existing code could get to 100 rows, not check at that point, read the next input - half a range tombstone - and only then check that we reached 100 rows and stop. The fix is to calculate the number of rows in the right place - exactly when it's needed, not before the step. The first change needs more justification: The old code, that incremented _op_count on every input fragment and not just output fragments did not fit the stated goal of its introduction - to avoid large allocations. In one test it resulted in breaking up the output mutation to chunks of 25 rows instead of the intended 100 rows. But, maybe there was another goal, to stop the iteration after 100 input rows and avoid the possibility of stalls if there are no output rows? It turns out the answer is no - we don't need this _op_count increment to avoid stalls: The function build_some() uses `co_await on_results()` to run one step of processing one input fragment - and `co_await` always checks for preemption. I verfied that indeed no stalls happen by using the existing test test_long_skipped_view_update_delete_with_timestamp. It generates a very long base update where all the view updates go to the same partition, but all but the last few updates don't generate any view updates. I confirmed that the fixed code loops over all these input rows without increasing _op_count and without generating any view update yet, but it does NOT stall. This patch also includes two tests reproducing this bug and confirming its fixed, and also two additional tests for breaking up long deletions that I wanted to make sure doesn't fail after this patch (it doesn't). By the way, this fix would have also fixed issue #12297 - which we fixed a year ago in a different way. That issue happend when the code went through 100 input rows without generating any output rows, and incorrectly concluding that there's no view update to send. With this fix, the code no longer stops generating the view update just because it saw 100 input rows - it would have waited until it generated 100 output rows in the view update (or the input is really done). Fixes #17117 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17164	2024-02-06 14:57:33 +02:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Nadav Har'El	1bcaeb89c7	view: revert cleanup filter that doesn't work with tablets This patch reverts commit `10f8f13b90` from November 2022. That commit added to the "view update generator", the code which builds view updates for staging sstables, a filter that ignores ranges that do not belong to this node. However, 1. I believe this filter was never necessary, because the view update code already silently ignores base updates which do not belong to this replica (see get_view_natural_endpoint()). After all, the view update needs to know that this replica is the Nth owner of the base update to send its update to the Nth view replica, but if no such N exists, no view update is sent. 2. The code introduced for that filter used a per-keyspace replication map, which was ok for vnodes but no longer works for tablets, and causes the operation using it to fail. 3. The filter was used every time the "view update generator" was used, regardless of whether any cleanup is necessary or not, so every such operation would fail with tablets. So for example the dtest test_mvs_populating_from_existing_data fails with tablets: * This test has view building in parallel with automatic tablet movement. * Tablet movement is streaming. * When streaming happens before view building has finished, the streamed sstables get "view update generator" run on them. This causes the problematic code to be called. Before this patch, the dtest test_mvs_populating_from_existing_data fails when tablets are enabled. After this patch, it passes. Fixes #16598 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-01-14 13:24:44 +02:00
Nadav Har'El	0fe40f729e	mv: sleep a bit before view-update-generator restart The "view update generator" is responsible for generating view updates for staging sstables (such as coming from repair). If the processing fails, the code retries - immediately. If there is some persistent bug, such as issue #16598, we will have a tight loop of error messages, potentially a gigabyte of identical messages every second. In this patch we simply add a sleep of one second after view update generation fails before retrying. We can still get many identical error messages if there is some bug, but not more than one per second. Refs #16598. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-01-14 13:13:52 +02:00
Kefu Chai	be364d30fd	db: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16664	2024-01-09 11:44:19 +02:00
Avi Kivity	6394854f04	Merge 'Some cleanups in tests for tablets + MV ' from Nadav Har'El This small series improves two things in the multi-node tests for tablet supports in materialized views: 1. The test for Alternator LSI, which "sometimes" could reproduce the bug by creating 10-node cluster with a random tablet distribution, is replaced by a reliable 2-node cluster which controls the tablet distribution. The new test also confirms that tablets are actually enabled in Alternator (reviewers of the original test noted it would be easy to pass the test if tablets were accidentally not enabled... :-)). 2. Simplify the tablet lookup code in the test to not go through a "table id", and lookup the table's (or view's) name directly (requires a full-table of the tablets table, but that's entirely reasonable in a test). The third patch in this series also fixes a comment typo discovered in a previous review. Closes scylladb/scylladb#16440 * github.com:scylladb/scylladb: materialized views: fix typo in comment test_mv_tablets: simplify lookup of tablets alternator, tablets: improve Alternator LSI tablets test	2023-12-27 20:18:14 +02:00
Benny Halevy	060b16f987	view: apply_to_remote_endpoints: fix use-after-free `b815aa021c` added a yield before the trace point, causing the moved `frozen_mutation_and_schema` (and `inet_address_vector_topology_change`) to drop out of scope and be destroyed, as the rvalue-referenced objects aren't moved onto the coroutine frame. This change passes them by value rather than by rvalue-reference so they will be stored in the coroutine frame. Fixes #16540 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#16541	2023-12-24 21:43:48 +02:00
Nadav Har'El	6640278aa7	materialized views: fix typo in comment Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-12-24 10:12:44 +02:00
Nadav Har'El	b815aa021c	mv, test: fix delay_before_remote_view_update injection point The "delay_before_remote_view_update" is a recently-added injection point which should add a delay before remove view updates, but NOT force the writer to wait for it (whether the writer waits for it or not depends on whether the view is configured as synchronous or not). Unfortunately, the delay was added at the WRONG place, which caused it to sometimes be done even on asynchronous views, breaking (with false-negative) the tests that need this delay to reproduce bugs of missing synchronous updates (Refs #16371). The fix here is even simpler then the (wrong) old code - we just add the sleep to the existing function apply_to_remote_endpoints() instead of making the caller even more complex. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-12-21 11:44:50 +02:00
Nadav Har'El	37b5c03865	mv: coroutinize wait code for remote view updates In the previous patch we added a delay injection point (for testing) in the view update code. Because the code was using continuation style, this resulted in increased indentation and ugly repetition of captures. So in this patch we coroutinize the code that waits for remote view updates, making it simpler, shorter, and less indented. Note that this function still uses continuations in one place: The remote view update is still composed of two steps that need to happen one after another, but we don't necessarily need to wait for them to happen. This is easiest to do with chaining continuations, and then either waiting or not waiting for the resulting future. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-12-17 20:15:08 +02:00
Nadav Har'El	bf6848d277	mv, test: add injection point to delay remove view update It's difficult to write a test (as we plan to do in to in the next patch) that verifies that synchronous view updates are indeed synchronous, i.e., that write with CL=QUORUM on the base-table write returns only after CL=QUORUM was also achieved in the view table. The difficulty is that in a fast test machine, even if the synchronous-view-update is completely buggy, it's likely that by the time the test reads from the view, all view updates will have been completed anyway. So in this patch we introduce an injection point, for testing, named "delay_before_remote_view_update", which adds a delay before the base replica sends its update to the remote view replica (in case the view replica is indeed remote). As usual, this injection point isn't configurable - when enabled it adds a fixed (0.5 second) delay, on all view updates on all tables. The existing code used continuation-style Seastar programming, and the addition of the injection point in this patch made it even uglier, so in the next patch we will coroutine-ize this code. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-12-17 20:15:08 +02:00
Petr Gusev	7b55ccbd8e	token_metadata: drop the template Replace token_metadata2 ->token_metadata, make token_metadata back non-template. No behavior changes, just compilation fixes.	2023-12-12 23:19:54 +04:00
Petr Gusev	e50dbef3e2	database: get_token_metadata -> new token_metadata database::get_token_metadata() is switched to token_metadata2. get_all_ips method is added to the host_id-based token_metadata, since its convenient and will be used in several places. It returns all current nodes converted to inet_address by means of the topology contained within token_metadata. hint_sender::can_send: if the node has already left the cluster we may not find its host_id. This case is handled in the same way as if it's not a normal token owner - we simply send a hint to all replicas.	2023-12-12 23:19:53 +04:00
Petr Gusev	5a1418fdba	token_metadata: get_endpoint_for_host_id -> get_endpoint_for_host_id_if_known This commit fixes an inconsistency in method names: get_host_id and get_host_id_if_known are (internal_error, returns null), but there was only one method for the opposite conversion - get_endpoint_for_host_id, and it returns null. In this commit we change it to on_internal_error if it can't find the argument and add another method get_endpoint_for_host_id_if_known which returns null in this case. We can't use get_endpoint_for_host_id/get_host_id in host_id_or_endpoint::resolve since it's called from storage_service::parse_node_list -> token_metadata::parse_host_id_and_endpoint, and exceptions are caught and handled in `storage_service::parse_node_list`.	2023-12-11 12:51:34 +04:00
Petr Gusev	63f64f3303	token_metadata: make it a template with NodeId=inet_address/host_id NodeId is used in all internal token_metadata data structures, that previously used inet_address. We choose topology::key_kind based on the value of the template parameter. generic_token_metadata::update_topology overload with host_id parameter is added to make update_topology_change_info work, it now uses NodeId as a parameter type. topology::remove_endpoint(host_id) is added to make generic_token_metadata::remove_endpoint(NodeId) work. pending_endpoints_for and endpoints_for_reading are just removed - they are not used and not implemented. The declarations were left by mistake from a refactoring in which these methods were moved to erm. generic_token_metadata_base is extracted to contain declarations, common to both token_metadata versions. Templates are explicitly instantiated inside token_metadata.cc, since implementation part is also a template and it's not exposed to the header. There are no other behavioral changes in this commit, just syntax fixes to make token_metadata a template.	2023-12-11 12:51:34 +04:00
Nadav Har'El	300e549267	tablets, mv: disable self-pairing when tablets are used A write to a base table can generate one or more writes to a materialized view. The write to RF base replicas need to cause writes to RF view replicas. Our MV implementation, based on Cassandra's implementation, does this via "pairing": Each one of the base replicas involved in this write sends each view update to exactly one view replica. The function get_view_natural_endpoint() tells a base replica which of the view replicas it should send the update to. The standard pairing is based on the ring order: The first owner of the base token sends to the first owner of the view token, the second to the second, and so on. However, the existing code also uses an optimization we call self-pairing: If a single node is both a base replica and a base replica, the pairing is modified so this node sends the update to itself. This patch disables the self-pairing optimization in keyspaces that use tablets: The self-pairing optimization can cause the pairing to change after token ranges are moved between nodes, so it can break base-view consistency in some edge cases, leading to "ghost rows". With tablets, these range movements become even more frequent - they can happen even if the cluster doesn't grow. This is why we want to solve this problem for tablets. For backward compatibility and to avoid sudden inconsistencies emerging during upgrades, we decided to continue using the self-pairing optimization for keyspaces that are not using tablets (i.e., using vnoodes). Currently, we don't introduce a "CREATE MATERIALIZED VIEW" option to override these defaults - i.e., we don't provide a way to disable self-pairing with vnodes or to enable them with tablets. We could introduce such a schema flag later, if we ever want to (and I'm not sure we want to). It's important to note, that in some cases, this change has implications on when view updates become synchronous, in the tablets case. For example: * If we have 3 nodes and RF=3, with the self-pairing optimization each node is paired with itself, the view update is local, and is implicitly synchronous (without requiring a "synchronous_updates" flag). * In the same setup with tablets, without the self-pairing optimization (due to this patch), this is not guaranteed. Some view updates may not be synchronous, i.e., the base write will not wait for the view write. If the user really wants synchronous updates, they should be requested explicitly, with the "synchronous_updates" view option. Fixes #16260. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16272	2023-12-06 17:11:17 +02:00
Benny Halevy	63b556123b	db/view: use locator::topology rather than fb_utilities Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 08:55:46 +02:00
Nadav Har'El	4505a86f46	tablets, mv: fix base-view pairing to consider base replication map In the view update code, the function get_view_natural_endpoint() determines which view replica this base replica should send an update to. It currently gets the view table's replication map (i.e., the map from view tokens to lists of replicas holding the token), but assumes that this is also the base table's replication map. This assumption was true with vnodes, but is no longer true with tablets - the base table's replication map can be completely different from the view table's. By looking at the wrong mapping, get_view_natural_endpoint() can believe that this node isn't really a base-replica and drop the view update. Alternatively, it can think it is a base replica - but use the wrong base-view pairing and create base-view inconsistencies. This patch solves this bug - get_view_natural_endpoint() now gets two separate replication maps - the base's and the view's. The callers need to remember what the base table was (in some cases they didn't care at the point of the call), and pass it to the function call. This patch also includes a simple test that reproduces the bug, and confirms it is fixed: The test has a 6-node cluster using tablets and a base table with RF=1, and writes one row to it. Before this patch, the code usually gets confused, thinking the base replica isn't a replica and loses the view update. With this patch, the view update works. Fixes #16227. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16228	2023-12-04 16:38:54 +02:00
Yaniv Kaul	2b73793a39	Update db/view/view.cc	2023-12-03 10:07:45 +02:00
Yaniv Kaul	c658bdb150	Typos: fix typos in comments Fixes some typos as found by codespell run on the code. In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc. Follow-up commits will take care of them. Refs: https://github.com/scylladb/scylladb/issues/16255 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2023-12-02 22:37:22 +02:00
Nadav Har'El	62f89d49e5	tablets, mv: fix on_internal_error on write to base table This situation before this patch is that when tablets are enabled for a keyspace, we can create a materialized view but later any write to the base table fails with an on_internal_error(), saying that: "Tried to obtain per-keyspace effective replication map of test but it's per-table." Indeed, with tablets, the replication is different for each table - it's not the same for the entire keyspace. So this patch changes the view update code to take the replication map from the specific base table, not the keyspace. This is good enough to get materialized-views reads and writes working in a simple single-node case, as the included test demonstrates (the test fails with on_internal_error() before this patch, and passes afterwards). But this fix is not perfect - the base-view pairing code really needs to consider not only the base table's replication map, but also the view table's replication map - as those can be different. We'll fix this remaining problem as a followup in a separate patch - it will require a substantially more elaborate test to reproduce the need for the different mapping and to verify that fix. Fixes #16209. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16211	2023-11-29 15:29:17 +01:00
Pavel Emelyanov	3471f30b58	view_update_generator: Unplug from database later Patch `967ebacaa4` (view_update_generator: Move abort kicking to do_abort()) moved unplugging v.u.g from database from .stop() to .do_abort(). The latter call happens very early on stop -- once scylla receives SIGINT. However, database may still need v.u.g. plugged to flush views. This patch moves unplug to later, namely to .stop() method of v.u.g. which happens after database is drained and should no longer continue view updates. fixes: #16001 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#16091	2023-11-20 11:47:55 +02:00
Benny Halevy	a1acf6854b	everywhere: reduce dependencies on i_partitioner.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-05 20:47:44 +02:00
Avi Kivity	ef7db6df99	Merge 'schema_tables: turn view schema fixing code into a sanity check' from Kamil Braun The purpose of `maybe_fix_legacy_secondary_index_mv_schema` was to deal with legacy materialized view schemas used for secondary indexes, schemas which were created before the notion of "computed columns" was introduced. Back then, secondary index schemas would use a regular "token" column. Later it became a computed column and old schemas would be migrated during rolling upgrade. The migration code was introduced in 2019 (`db8d4a0cc6`) and then fixed in 2020 (`d473bc9b06`). The fix was present in Enterprise 2022.1 and in OSS 4.5. So, assuming that users don't try crazy things like upgrading from 2021.X to 2023.X (which we do not support), all clusters will have already executed the migration code once they upgrade to 2023.X, meaning we can get rid of it. The main motivation of this PR is to get rid of the `db::schema_tables::merge_schema` call in `parse_schema_tables`. In Raft mode this was the only call to `merge_schema` outside "group 0 code" and in fact it is unsafe -- it uses locally generated mutations with locally generated timestamp (`api::new_timestamp()`), so if we actually did it, we would permanently diverge the group 0 state machine across nodes (the schema pulling code is disabled in Raft mode). Fortunately, this should be dead code by now, as explained in the previous paragraph. The migration code is now turned into a sanity check, if the users try something crazy, they will get an error instead of silent data corruption. Closes scylladb/scylladb#15695 * github.com:scylladb/scylladb: view: remove unused `_backing_secondary_index` schema_tables: turn view schema fixing code into a sanity check schema_tables: make comment more precise feature_service: make COMPUTED_COLUMNS feature unconditionally true	2023-10-31 13:23:19 +02:00
Marcin Maliszkiewicz	020a9c931b	db: view: run local materialized view mutations on a separate smp service group When base write triggers mv write and it needs to be send to another shard it used the same service group and we could end up with a deadlock. This fix affects also alternator's secondary indexes. Testing was done using (yet) not committed framework for easy alternator performance testing: https://github.com/scylladb/scylladb/pull/13121. I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and then ran: ./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \ --developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \ --duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000 Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb: p seastar::get_smp_service_groups_semaphore(2,0)._count $1 = 0 With the patch I wasn't able to observe the problem, even with 2x concurrency. I was able to make the process hang with 10x concurrency but I think it's hitting different limit as there wasn't any depleted smp service group semaphore and it was happening also on non mv loads. Fixes https://github.com/scylladb/scylladb/issues/15844 Closes scylladb/scylladb#15845	2023-10-29 18:30:32 +02:00
Kamil Braun	db49ccccb0	view: remove unused `_backing_secondary_index` This boolean was only used for a sanity check which was replaced with a stronger sanity check in the previous commit that doesn't require the boolean.	2023-10-24 13:33:36 +02:00
Kamil Braun	3976808b12	schema_tables: turn view schema fixing code into a sanity check The purpose of `maybe_fix_legacy_secondary_index_mv_schema` was to deal with legacy materialized view schemas used for secondary indexes, schemas which were created before the notion of "computed columns" was introduced. Back then, secondary index schemas would use a regular "token" column. Later it became a computed column and old schemas would be migrated during rolling upgrade. The migration code was introduced in 2019 (`db8d4a0cc6`) and then fixed in 2020 (`d473bc9b06`). The fix was present in Enterprise 2022.1 and in OSS 4.5. So, assuming that users don't try crazy things like upgrading from 2021.X to 2023.X (which we do not support), all clusters will have already executed the migration code once they upgrade to 2023.X, meaning we can get rid of it. The main motivation of this patch is to get rid of the `db::schema_tables::merge_schema` call in `parse_schema_tables`. In Raft mode this was the only call to `merge_schema` outside "group 0 code" and in fact it is unsafe -- it uses locally generated mutations with locally generated timestamp (`api::new_timestamp()`), so if we actually did it, we would permanently diverge the group 0 state machine across nodes (the schema pulling code is disabled in Raft mode). Fortunately, this should be dead code by now, as explained in the previous paragraph. The migration code is now turned into a sanity check, if the users try something crazy, they will get an error instead of silent data corruption.	2023-10-24 13:33:35 +02:00
Jan Ciolek	940e44f887	db/view: change log level of failed view updates to WARN When a remote view update doesn't succeed there's a log message saying "Error applying view update...". This message had log level ERROR, but it's not really a hard error. View updates can fail for a multitude of reasons, even during normal operation. A failing view update isn't fatal, it will be saved as a view hint a retried later. Let's change the log level to WARN. It's something that shouldn't happen too much, but it's not a disaster either. ERROR log level causes trouble in tests which assume that an ERROR level message means that the test has failed. Refs: https://github.com/scylladb/scylladb/issues/15046#issuecomment-1712748784 For local view updates the log level stays at "ERROR", local view updates shouldn't fail. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes scylladb/scylladb#15640	2023-10-11 18:19:23 +03:00
Pavel Emelyanov	becd960ae8	view_update_generator: Add logging to do_abort() Just tell the logs that the guy is aborting refs: #10941 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-21 13:34:21 +03:00
Pavel Emelyanov	967ebacaa4	view_update_generator: Move abort kicking to do_abort() When v.u.g. stops is first aborts the generation background fiber by requesting abort on the internal abort source and signalling the fiber in case it's waiting. Right now v.u.g.::stop() is defer-scheduled last in main(), so this move doesn't change much -- when stop_signal fires, it will kick the v.u.g.::do_abort() just a bit earlier, there's nothing that would happen after it before real ::stop() is called that depends on it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-21 13:32:45 +03:00
Pavel Emelyanov	e34220ebb7	view_update_generator: Add early abort subscription Subscribe v.u.g. to the main's stop_signal. For now a no-op callback. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-21 13:32:45 +03:00
Benny Halevy	2c54d7a35a	view, storage_proxy: carry effective_replication_map along with endpoints When sending mutation to remote endpoint, the selected endpoints must be in sync with the current effective_replication_map. Currently, the endpoints are sent down the storage_proxy stack, and later on an effective_replication_map is retrieved again, and it might not match the target or pending endpoints, similar to the case seen in https://github.com/scylladb/scylladb/issues/15138 The correct way is to carry the same effective replication map used to select said endpoints and pass it down the stack. See also https://github.com/scylladb/scylladb/pull/15141 Fixes scylladb/scylladb#15144 Fixes scylladb/scylladb#14730 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15142	2023-08-29 09:08:42 +03:00
Benny Halevy	8fbcf1ab9f	view: start: ignore also abort_requested_exception We see the abort_requested_exception error from time to time, instead of sleep_aborted that was expected and quietly ignored (in debug log level). Treat abort_requested_exception the same way since the error is expected on shutdown and to reduce test flakiness, as seen for example, in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3033/artifact/logs-full.release.010/1691896356104_repair_additional_test.py%3A%3ATestRepairAdditional%3A%3Atest_repair_schema/node2.log ``` INFO 2023-08-13 03:12:29,151 [shard 0] compaction_manager - Asked to stop WARN 2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes={}: seastar::sleep_aborted (Sleep is aborted) INFO 2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Finished main loop WARN 2023-08-13 03:12:29,152 [shard 0] cdc - Aborted update CDC description table with generation (2023/08/13 03:12:17, d74aad4b-6d30-4f22-947b-282a6e7c9892) INFO 2023-08-13 03:12:29,152 [shard 1] compaction_manager - Asked to stop INFO 2023-08-13 03:12:29,152 [shard 1] compaction_manager - Stopped INFO 2023-08-13 03:12:29,153 [shard 0] init - Signal received; shutting down INFO 2023-08-13 03:12:29,153 [shard 0] init - Shutting down view builder ops INFO 2023-08-13 03:12:29,153 [shard 0] view - Draining view builder INFO 2023-08-13 03:12:29,153 [shard 1] view - Draining view builder INFO 2023-08-13 03:12:29,153 [shard 0] compaction_manager - Stopped ERROR 2023-08-13 03:12:29,153 [shard 0] view - start failed: seastar::abort_requested_exception (abort requested) ERROR 2023-08-13 03:12:29,153 [shard 1] view - start failed: seastar::abort_requested_exception (abort requested) ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15029	2023-08-13 18:39:09 +03:00
Botond Dénes	4a02865ea1	Merge 'Prevent invalidation of iterators over database::_column_families' from Aleksandra Martyniuk Maps related to column families in database are extracted to a column_families_data class. Access to them is possible only through methods. All methods which may preempt hold rwlock in relevant mode, so that the iterators can't become invalid. Fixes: #13290 Closes #13349 * github.com:scylladb/scylladb: replica: make tables_metadata's attributes private replica: add methods to get a filtered copy of tables map replica: add methods to check if given table exists replica: add methods to get table or table id replica: api: return table_id instead of const table_id& replica: iterate safely over tables related maps replica: pass tables_metadata to phased_barrier_top_10_counts replica: add methods to safely add and remove table replica: wrap column families related maps into tables_metadata replica: futurize database::add_column_family and database::remove	2023-07-31 15:31:59 +03:00
Botond Dénes	d66b07823b	db/view/view_updating_consumer: account for the size of mutations All partitions will have a corresponding mutation object in the buffer. These objects have non-negligible sizes, yet the consumer did not bump the _buffer_size when a new partition was consumer. This resulted in empty partitions not moving the _buffer_size at all, and thus they could accumulate without bounds in the buffer, never triggering a flush just by themselves. We have recently seen this causing OOM. This patch fixes that by bumping the _buffer_size with the size of the freshly created mutation object.	2023-07-26 03:07:25 -04:00
Aleksandra Martyniuk	cdbfa0b2f5	replica: iterate safely over tables related maps Loops over _column_families and _ks_cf_to_uuid which may preempt are protected by reader mode of rwlock so that iterators won't get invalid.	2023-07-25 17:13:04 +02:00

1 2 3 4 5 ...

515 Commits