scylladb

Author	SHA1	Message	Date
Piotr Dulikowski	a1436f1ce2	db/view: drop view updates to replaced node marked as left When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address. This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0. As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas. Fixes: scylladb/scylladb#19439 (cherry picked from commit `6af7882c59`)	2024-07-26 14:02:51 +00:00
Wojciech Mitros	813fef44d3	exceptions: make view update timeouts inherit from timed_out_error Currently, when generating and propagating view updates, if we notice that we've already exceeded the time limit, we throw an exception inheriting from `request_timeout_exception`, to later catch and log it when finishing request handling. However, when catching, we only check timeouts by matching the `timed_out_error` exception, so the exception thrown in the view update code is not registered as a timeout exception, but an unknown one. This can cause tests which were based on the log output to start failing, as in the past we were noticing the timeout at the end of the request handling and using the `timed_out_error` to keep processing it and now, even though we do notice the timeout even earlier, due to it's type we log an error to the log, instead of treating it as a regular timeout. In this patch we make the error thrown on timeout during view updates inherit from `timed_out_error` instead of the `request_timeout_exception` (it is also moved from the "exceptions" directory, where we define exceptions returned to the user). Aside from helping with the issue described above, we also improve our metrics, as the `request_timeout_exception` is also not checked for in the `is_timeout_exception` method, and because we're using it to check whether we should update write timeout metrics, they will only start getting updated after this patch. Fixes #19261 (cherry picked from commit `4aa7ada771`) Closes scylladb/scylladb#19262	2024-06-13 12:01:12 +03:00
Wojciech Mitros	3c47ab9851	mv: handle different ERMs for base and view table When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to `f2ff701`, the update may have just been performed partially when we are taking the ERMs. Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash. This is fixed in this patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1. Fixes: #17786 Fixes: #18709 (cherry-picked from `519317dc58`) This commit also includes the follow-up patch that removes the flakiness from the test that is introduced by the commit above. The flakiness was caused by enabling the delay_before_get_view_natural_endpoint injection on a node and not disabling it before the node is shut down. The patch removes the enabling of the injection on the node in the first place. By squashing the commits, we won't introduce a place in the commit history where a potential bisect could mistakenly fail. Fixes: https://github.com/scylladb/scylladb/issues/18941 (cherry-picked from `0de3a5f3ff`) Closes scylladb/scylladb#18974	2024-05-30 09:13:31 +02:00
Pavel Emelyanov	b24fb8dc87	inet_address: Remove to_sstring() in favor of fmt::to_string The existing inet_address::to_string() calls fmt::format("{}", *this) anyway. However, the to_string() method is declared in .cc file, while form formatter is in the header and is equipeed with constexprs so that converting an address to string is done as much as possible compile-time. Also, though minor, fmt::to_string(foo) is believed to be even faster than fmt::format("{}", foo). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18712	2024-05-21 09:43:08 +03:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Wojciech Mitros	485eb7a64c	test: add test for bad_allocs during large mv queries This patch adds a test for reproducing issue #12379, which is being fixed in #16819. The test case works by creating a table with a materialized view, and then performing a partition delete query on it. At the same time, it uses injections to limit the memory to a level lower than usual, in order to increase the consistency of the test, and to limit its runtime. Before #16819, the test would exceed the limit and fail, and now the next allocation is throttled using a sleep.	2024-05-13 18:16:39 +02:00
Jan Ciolek	e0442d7bfa	mv: throttle view update generation for large queries For every mutation applied to the base table we have to generate the corresponding materialized view table updates. In case of simple requests, like INSERT or UPDATE, the number of view updates generated per base table mutation is limited to at most a few view table updates per base table update. The situation is different for DELETE queries, which can delete the whole partitions or clustering ranges. Range deletions are fast on the base table, but for the view table the situation is different. Deleting a single partition in the base table will generate as many singular view updates as there are rows in the deleted partition, which could potentially be in the millions. To prevent OOM view updates are generated in batches of at most 100 rows. There is a loop which generates the next batch of updates, spawns tasks to send them to remote nodes, generates another batch and so on. The problem is that there is no concurrency control - each batch is scheduled to be sent in the background, but the following batch is generated without waiting for the previously generated updates to be sent. This can lead to unbounded concurrency and OOM. To protect against this view update generation should be limited somehow. There is an existing mechanism for limiting view updates - throttling. We keep track of how many pending view updates there are, in the view backlog, and delay responses to the client based on this backlog's fullness. For a well behaved client with limited concurrency this will slow down the amount of incoming requests until it reaches an optimal point. This works for simple queries (INSERT, UPDATE, ...), but it doesn't do anything for range DELETEs. A DELETE is a single request that generates millions of view updates, delaying client response doesn't help. The throttling mechanism could be extend to cover this case - we could treat the DELETE request like any other client and force it to wait before sending more updates. This commit implements this approach - before sending the next batch of updates the generator is forced to sleep for a bit of time, calculated using the exisiting throttling equation. The more full the backlog gets the more the generator will have to sleep for, and hopefully this will prevent overloading the system with view updates. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2024-05-13 18:16:23 +02:00
Jan Ciolek	ae28b8bdb7	db/view: extract view throttling delay calculation to a global function In order to prevent overload caused by too many view updates, their number is limited by delaying client responses. The amount of time to delay for is calculated based on the fullness of the view update backlog. Currently this is done in the function calculate_delay, used by abstract_write_response_handler. In the following commits I will introduce another throttling mechanism that uses the same equation to calculate wait time, so it would be good to reuse the exsiting function. Let's make the function globally accessible. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2024-05-13 18:14:56 +02:00
Jan Ciolek	59b7920b0b	view_update_generator: add get_storage_proxy() During view generation we would like to be able to access information about the current state of view update backlogs, but this information is kept inside storage_proxy. A reference to storage_proxy is kept inside view_update_generator, so the easiest way to get access to it from the view update code is by adding a public getter there. There's already a similar getter for replica::database: get_db(), so it's in line with the rest of the code. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2024-05-02 10:59:55 +02:00
Pavel Emelyanov	67736b5cd3	Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB" This reverts commit `9c2a836607`.	2024-05-02 08:16:14 +03:00
Pavel Emelyanov	d47053266b	view: Abort pending view updates when draining When view builder is drained (it now happens very early, but next patch moves this into regular drain) it waits for all on-going view build steps to complete. This includes waiting for any outstanding proxy view writes to complete as well. View writes in proxy have very high timeout of 5 minutes but they are cancellable. However, canecelling of such writes happens in proxy's drain_on_shutdown() call which, in turn, happens pretty late on shutdown. Effectively, by the time it happens all view writes mush have completed already, so stop-time cancelling doesn't really work nowadays. Next patch makes view builder drain happen a bit later during shutdown, namely -- _after_ shutting down messaging service. When it happen that late, non-working view writes cancellation becomes critical, as view builder drain hangs for aforementioned 5 minutes. This patch explicitly cancels all view writes when view builder stops. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-02 08:16:12 +03:00
Botond Dénes	65a385f5d0	Merge 'Relax the way view builder code checks if a table exists' from Pavel Emelyanov There are two places that workaround db.column_family_exists() call with some fancy exceptions-catching lambda. This PR makes things simpler. Closes scylladb/scylladb#18441 * github.com:scylladb/scylladb: view: Open-code one line lambda checking if table exists view: Use non-throwoing check if a table exists	2024-05-01 10:14:58 +03:00
Pavel Emelyanov	7f2742893e	view: Open-code one line lambda checking if table exists Continuation of the previous patch. The lambda in question used to be a heavyweight(y) code, but now it's one-liner. And it's only called once, so no more point in keeping it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-26 20:19:38 +03:00
Pavel Emelyanov	a3e76f9c93	view: Use non-throwoing check if a table exists Two places in view code check if a table exists by finding its schema ID and catching no_such_column_family exception. That's a bit heavyweight, database has column_family_exists() method for such cases. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-26 20:17:35 +03:00
Botond Dénes	044fd7a3ec	Merge 'Move some view updating methods from table to view_update_generator' from Pavel Emelyanov The populate_views() and generate_and_propagate_view_updates() both naturally belong to view_update_generator -- they don't need anything special from table itself, but rather depend on some internals of the v.u.generator itself. Moving them there lets removing the view concurrency semaphore from keyspace and table, thus reducing the cross-components dependencies. Closes scylladb/scylladb#18421 * github.com:scylladb/scylladb: replica: Do not carry view concurrency semaphore pointer around view: Get concurrency semaphore via database, not table view_update_generator: Mark mutate_MV() private view: Move view_update_generator methods' code view: Move table::generate_and_propagate_view_updates into view code view: Move table::populate_views() into view_update_generator class	2024-04-26 10:55:38 +03:00
Pavel Emelyanov	4ac30e5337	view-builder: Print correct exception in built ste exception handler Inside .handle_exception() continuation std::current_exception() doesn't work, there's std::exception ex argument to handler's lambda instead fixes #18423 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18349	2024-04-26 09:58:45 +03:00
Pavel Emelyanov	2ee7c41139	view: Get concurrency semaphore via database, not table The _view_update_concurrency_sem field on database propagates itself via keyspace config down to table config and view_update_generator then grabs one via table:: helper. That's an overkil, view_update_generator has direct reference on the database and can get this semaphore from there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:25:57 +03:00
Pavel Emelyanov	3d8b572d96	view_update_generator: Mark mutate_MV() private Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:25:40 +03:00
Pavel Emelyanov	bc4552740f	view: Move view_update_generator methods' code Now when the two methods belong to another class, move the code itself to db/view , where the class itself resides. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:24:20 +03:00
Pavel Emelyanov	c2bf6b43b2	view: Move table::generate_and_propagate_view_updates into view code Similarly to populate_views() method, this one also naturally belongs to view_update_generator class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:20:06 +03:00
Pavel Emelyanov	670c7c925c	view: Move table::populate_views() into view_update_generator class The method in question has little to do with table, effectively it only needs stats and consurrency semaphore. And the semaphore in question is obtained from table indirectly, it really resides on database. On the other hand, the method carries lots of bits from db::view, e.g. the view_update_builder class, memory_usage_of() helper and a bit more. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:17:20 +03:00
Pavel Emelyanov	1b1b86809d	view-builder: Coroutinize stop() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-23 20:43:42 +03:00
Pavel Emelyanov	eaf78fca04	view_builder: Do not try to handle step join exceptions on stop Commit `23c891923e` (main: make sure view_builder doesn't propagate semaphore errors) ignored some exceptions that could pop up from the _build_step/do_build_step() serialized action, since they are "benign" on stop. Later there came `b56b10a4bb` (view_builder: do_build_step: handle unexpected exceptions) that plugged any exception from the action in question, regardless of they happen on stop or run-time. Apparently, the latter commit supersedes the former. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-23 20:26:14 +03:00
Kefu Chai	a439ebcfce	treewide: include fmt/ranges.h and/or fmt/std.h before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we include `fmt/ranges.h` and/or `fmt/std.h` for formatting the container types, like vector, map optional and variant using {fmt} instead of the homebrew formatter based on operator<<. with this change, the changes adding fmt::formatter and the changes using ostream formatter explicitly, we are allowed to drop `FMT_DEPRECATED_OSTREAM` macro. Refs scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-04-19 22:56:16 +08:00
Kamil Braun	9c2a836607	Revert "Merge 'Drain view_builder in generic drain' from ScyllaDB" This reverts commit `298a7fcbf2`, reversing changes made to `5cf53e670d`. The change made CI flaky. Fixes: scylladb/scylladb#18278	2024-04-18 11:50:41 +02:00
Pavel Emelyanov	90593f4e82	view_builder: Generalize mark_as_built(view_ptr) method Marking is performed in two places and they can be generalized Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-05 19:56:12 +03:00
Pavel Emelyanov	3c3f2cd337	view_builder: Move mark_existing_views_as_built from storage service Now it's in the correct component Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-05 19:56:11 +03:00
Wojciech Mitros	9789a3dc7c	mv: keep semaphore units alive until the end of a remote view update When a view update has both a local and remote target endpoint, it extends the lifetime of its memory tracking semaphore units only until the end of the local update, while the resources are actually used until the remote update finishes. This patch changes the semaphore transferring so that in case of both local and remote endpoints, both view updates share the units, causing them to be released only after the update that takes longer finishes. Fixes #17890 Closes scylladb/scylladb#17891	2024-03-25 19:43:58 +02:00
Avi Kivity	72bbe75d5b	Merge 'Fix node replace with tablets for RF=N' from Tomasz Grabiec This PR fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because tablet replica allocation for rebuild will not be able to find a viable destination, as the replacing node is not considered to be a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes https://github.com/scylladb/scylladb/issues/17025 Nodes in the left state will be kept in tablet replica sets for a while after node replace is done, until the new replica is rebuilt. So we need to know about those node's location (dc, rack) for two reasons: 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first. 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement. It's ok to not know the IP, and we don't keep it. Those nodes will not be present in the IP-based replica sets, e.g. those returned by get_natural_endpoints(), only in host_id-based replica sets. storage_proxy request coordination is not affected. Nodes in the left state are still not present in token ring, and not considered to be members of the ring (datacanter endpoints excludes them). In the future we could make the change even more transparent by only loading locator::node* for those nodes and keeping node* in tablet replica sets. Currently left nodes are never removed from topology, so will accumulate in memory. We could garbage-collect them from topology coordinator if a left node is absent in any replica set. That means we need a new state - left_for_real. Closes scylladb/scylladb#17388 * github.com:scylladb/scylladb: test: py: Add test for view replica pairing after replace raft, api: Add RESTful API to query current leader of a raft group test: test_tablets_removenode: Verify replacing when there is no spare node doc: topology-on-raft: Document replace behavior with tablets tablets, raft topology: Rebuild tablets after replacing node is normal tablets: load_balancer: Access node attributes via node struct tablets: load_balancer: Extract ensure_node() mv: Switch to using host_id-based replica set effective_replication_map: Introduce host_id-based get_replicas() raft topology: Keep nodes in the left state to topology tablets: Introduce read_required_hosts()	2024-03-18 16:16:08 +02:00
Wojciech Mitros	efcb718e0a	mv: adjust memory tracking of single view updates within a batch Currently, when dividing memory tracked for a batch of updates we do not take into account the overhead that we have for processing every update. This patch adds the overhead for single updates and joins the memory calculation path for batches and their parts so that both use the same overhead. Fixes #17854 Closes scylladb/scylladb#17855	2024-03-18 14:31:54 +02:00
Tomasz Grabiec	9b656ec2aa	mv: Switch to using host_id-based replica set This is necessary to not break replica pairing between base and view. After replacing a node, tablet replica set contains for a while the replaced node which is in the left state. This node is not returned by the IP-based get_natural_endpoints() so the replica indexes would shift, changing the pairing with the view. The host_id-based replica set always has stable indexes for replicas.	2024-03-15 11:05:29 +01:00
Pavel Emelyanov	3a734facc7	view_builder: Complete build step early if reader produces nothing Builder works in "steps". Each step runs for a given base table, when a new view is created it either initiates a step or appends to currently running step. Running a step means reading mutations from local sstables reader and applying them to all views that has jumped into this step so far. When a view is added to the step it remembers the current token value the step is on. When step receives end-of-stream it rewinds to minimal-token. Rewinding is done by closing current reader and creating a new one. Each time token is advanced, all the views that meet the new token value for the second time (i.e. -- scan full round) are marked as built and are removed from step. When no views are left on step, it finishes. The above machinery can break when rewinding the end-of-stream reader. The trick is that a running step silently assumes that if the reader once produced some token (and there can be a view that remembered this token as its starting one), then after rewinding the reader would generate the same token or greater. With tablets, however, that's not the case. When a node is decommissioned tablets are cleaned and all sstables are removed. Rewinding a reader after it makes empty reader that produces no tokens from now on. Respectively, any build steps that had captured tokens prior to cleanup would get stuck forever. The fix is to check if the mutation consumer stepped at least one step forward after rewind, and if no -- complete all the attached views. fixes: #17293 Similar thing should happen if the base table is truncated with views being built from it. Testing it steps on compaction assertion elsewhere and needs more research. refs: #17543 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#17548	2024-03-12 14:58:47 +02:00
Nadav Har'El	19bcea6216	materialized views: fix rare failure caused by empty update This one-line patch fixes a failure in the dtest lwt_schema_modification_test.py::TestLWTSchemaModification ::test_table_alter_delete Where an update sometimes failed due to an internal server error, and the log had the mysterious warning message: "std::logic_error (Empty materialized view updated)" We've also seen this log-message in the past in another user's log, and never understood what it meant. It turns out that the error message was generated (and warning printed) while building view updates for a base-table mutation, and noticing that the base mutation contains an empty row - a row with no cells or tombstone or anything whatsoever. This case was deemed (8 years ago, in `d5a61a8c48`) unexpected and nonsensical, and we threw an exception. But this case actually can happen - here is how it happened in test_table_alter_delete - which is a test involving a strange combination of materialized views, LWT and schema changes: 1. A table has a materialized view, and also a regular column "int_col". 2. A background thread repeatedly drops and re-creates this column int_col. 3. Another thread deletes rows with LWT ("IF EXISTS"). 4. These LWT operations each reads the existing row, and because of repeated drop-and-recreate of the "int_col" column, sometimes this read notices that one node has a value for int_col and the other doesn't, and creates a read-repair mutation setting int_col (the difference between the two reads includes just in this column). 5. The node missing "int_col" receives this mutation which sets only int_col. It upgrade()s this mutation to its most recent schema, which doesn't have int_col, so it removes this column from the mutation row - and is left with a completely empty mutation row. This completely empty row is not useful, but upgrade() doesn't remove it. 6. The view-update generation code sees this empty base-mutation row and fails it with this std::logic_error. 7. The node which sent the read-repair mutation sees that the read repair failed, so it fails the read and therefore fails the LWT delete operation. It is this LWT operation which failed in the test, and caused the whole test to fail. The fix is trivial: an empty base-table row mutation should simply be ignored when generating view updates - it shouldn't cause any error. Before this patch, test_table_alter_delete used to fail in roughly 20% of the runs on my laptop. After this patch, I ran it 100 times without a single failure. Fixes #15228 Fixes #17549 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17607	2024-03-07 12:00:43 +02:00
Avi Kivity	51df8b9173	interval: rename nonwrapping_interval to interval Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token. We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility. Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping. We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias.	2024-02-21 19:43:17 +02:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Botond Dénes	7f17d3bb0e	replica/database: keyspace: add uses_tablets() Mirroring table::uses_tablets(), provides a convenient and -- more importabtly -- easily discoverable way to determine whether the keyspace uses tablets or not. This information is of course already available via the abstract replication strategy, but as seen in a few examples, this is not easily discoverable and sometimes people resorted to enumerating the keyspace's tables to be able to invoke table::uses_tablets().	2024-02-15 01:51:26 -05:00
Nadav Har'El	21e7deafeb	alternator, mv: fix case of two new key columns in GSI A materialized view in CQL allows AT MOST ONE view key column that wasn't a key column in the base table. This is because if there were two or more of those, the "liveness" (timestamp, ttl) of these different columns can change at every update, and it's not possible to pick what liveness to use for the view row we create. We made an exception for this rule for Alternator: DynamoDB's API allows creating a GSI whose partition key and range key are both regular columns in the base table, and we must support this. We claim that the fact that Alternator allows neither TTL (Alternator's "TTL" is a different feature) nor user-defined timestamps, does allow picking the liveness for the view row we create. But we did it wrong! We claimed in a comment - and implemented in the code before this patch - that in Alternator we can assume that both GSI key columns will have the same liveness, and in particular timestamp. But this is only true if one modifies both columns together! In fact, in general it is not true: We can have two non-key attributes 'a' and 'b' which are the GSI's key columns, and we can modify only b, without modifying a, in which case the timestamp of the view modification should be b's newer timestamp, not a's older one. The existing code took a's timestamp, assuming it will be the same as b's, which is incorrect. The result was that if we repeatedly modify only b, all view updates will receive the same timestamp (a's old timestamp), and a deletion will always win over all the modifications. This patch includes a reproducing test written by a user (@Zak-Kent) that demonstrates how after a view row is deleted it doesn't get recreated - because all the modifications use the same timestamp. The fix is, as suggested above, to use the higher of the two timestamps of both base-regular-column GSI key columns as the timestamp for the new view rows or view row deletions. The reproducer that failed before this patch passes with it. As usual, the reproducer passes on AWS DynamoDB as well, proving that the test is correct and should really work. Fixes #17119 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17172	2024-02-12 13:17:29 +02:00
Nadav Har'El	14315fcbc3	mv: fix missing view deletions in some cases of range tombstones For efficiency, if a base-table update generates many view updates that go the same partition, they are collected as one mutation. If this mutation grows too big it can lead to memory exhaustion, so since commit `7d214800d0` we split the output mutation to mutations no longer than 100 rows (max_rows_for_view_updates) each. This patch fixes a bug where this split was done incorrectly when the update involved range tombstones, a bug which was discovered by a user in a real use case (#17117). Range tombstones are read in two parts, a beginning and an end, and the code could split the processing between these two parts and the result that some of the range tombstones in update could be missed - and the view could miss some deletions that happened in the base table. This patch fixes the code in two places to avoid breaking up the processing between range tombstones: 1. The counter "_op_count" that decides where to break the output mutation should only be incremented when adding rows to this output mutation. The existing code strangely incrmented it on every read (!?) which resulted in the counter being incremented on every input fragment, and in particular could reach the limit 100 between two range tombstone pieces. 2. Moreover, the length of output was checked in the wrong place... The existing code could get to 100 rows, not check at that point, read the next input - half a range tombstone - and only then check that we reached 100 rows and stop. The fix is to calculate the number of rows in the right place - exactly when it's needed, not before the step. The first change needs more justification: The old code, that incremented _op_count on every input fragment and not just output fragments did not fit the stated goal of its introduction - to avoid large allocations. In one test it resulted in breaking up the output mutation to chunks of 25 rows instead of the intended 100 rows. But, maybe there was another goal, to stop the iteration after 100 input rows and avoid the possibility of stalls if there are no output rows? It turns out the answer is no - we don't need this _op_count increment to avoid stalls: The function build_some() uses `co_await on_results()` to run one step of processing one input fragment - and `co_await` always checks for preemption. I verfied that indeed no stalls happen by using the existing test test_long_skipped_view_update_delete_with_timestamp. It generates a very long base update where all the view updates go to the same partition, but all but the last few updates don't generate any view updates. I confirmed that the fixed code loops over all these input rows without increasing _op_count and without generating any view update yet, but it does NOT stall. This patch also includes two tests reproducing this bug and confirming its fixed, and also two additional tests for breaking up long deletions that I wanted to make sure doesn't fail after this patch (it doesn't). By the way, this fix would have also fixed issue #12297 - which we fixed a year ago in a different way. That issue happend when the code went through 100 input rows without generating any output rows, and incorrectly concluding that there's no view update to send. With this fix, the code no longer stops generating the view update just because it saw 100 input rows - it would have waited until it generated 100 output rows in the view update (or the input is really done). Fixes #17117 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17164	2024-02-06 14:57:33 +02:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Nadav Har'El	1bcaeb89c7	view: revert cleanup filter that doesn't work with tablets This patch reverts commit `10f8f13b90` from November 2022. That commit added to the "view update generator", the code which builds view updates for staging sstables, a filter that ignores ranges that do not belong to this node. However, 1. I believe this filter was never necessary, because the view update code already silently ignores base updates which do not belong to this replica (see get_view_natural_endpoint()). After all, the view update needs to know that this replica is the Nth owner of the base update to send its update to the Nth view replica, but if no such N exists, no view update is sent. 2. The code introduced for that filter used a per-keyspace replication map, which was ok for vnodes but no longer works for tablets, and causes the operation using it to fail. 3. The filter was used every time the "view update generator" was used, regardless of whether any cleanup is necessary or not, so every such operation would fail with tablets. So for example the dtest test_mvs_populating_from_existing_data fails with tablets: * This test has view building in parallel with automatic tablet movement. * Tablet movement is streaming. * When streaming happens before view building has finished, the streamed sstables get "view update generator" run on them. This causes the problematic code to be called. Before this patch, the dtest test_mvs_populating_from_existing_data fails when tablets are enabled. After this patch, it passes. Fixes #16598 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-01-14 13:24:44 +02:00
Nadav Har'El	0fe40f729e	mv: sleep a bit before view-update-generator restart The "view update generator" is responsible for generating view updates for staging sstables (such as coming from repair). If the processing fails, the code retries - immediately. If there is some persistent bug, such as issue #16598, we will have a tight loop of error messages, potentially a gigabyte of identical messages every second. In this patch we simply add a sleep of one second after view update generation fails before retrying. We can still get many identical error messages if there is some bug, but not more than one per second. Refs #16598. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-01-14 13:13:52 +02:00
Kefu Chai	be364d30fd	db: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16664	2024-01-09 11:44:19 +02:00
Avi Kivity	6394854f04	Merge 'Some cleanups in tests for tablets + MV ' from Nadav Har'El This small series improves two things in the multi-node tests for tablet supports in materialized views: 1. The test for Alternator LSI, which "sometimes" could reproduce the bug by creating 10-node cluster with a random tablet distribution, is replaced by a reliable 2-node cluster which controls the tablet distribution. The new test also confirms that tablets are actually enabled in Alternator (reviewers of the original test noted it would be easy to pass the test if tablets were accidentally not enabled... :-)). 2. Simplify the tablet lookup code in the test to not go through a "table id", and lookup the table's (or view's) name directly (requires a full-table of the tablets table, but that's entirely reasonable in a test). The third patch in this series also fixes a comment typo discovered in a previous review. Closes scylladb/scylladb#16440 * github.com:scylladb/scylladb: materialized views: fix typo in comment test_mv_tablets: simplify lookup of tablets alternator, tablets: improve Alternator LSI tablets test	2023-12-27 20:18:14 +02:00
Benny Halevy	060b16f987	view: apply_to_remote_endpoints: fix use-after-free `b815aa021c` added a yield before the trace point, causing the moved `frozen_mutation_and_schema` (and `inet_address_vector_topology_change`) to drop out of scope and be destroyed, as the rvalue-referenced objects aren't moved onto the coroutine frame. This change passes them by value rather than by rvalue-reference so they will be stored in the coroutine frame. Fixes #16540 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#16541	2023-12-24 21:43:48 +02:00
Nadav Har'El	6640278aa7	materialized views: fix typo in comment Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-12-24 10:12:44 +02:00
Nadav Har'El	b815aa021c	mv, test: fix delay_before_remote_view_update injection point The "delay_before_remote_view_update" is a recently-added injection point which should add a delay before remove view updates, but NOT force the writer to wait for it (whether the writer waits for it or not depends on whether the view is configured as synchronous or not). Unfortunately, the delay was added at the WRONG place, which caused it to sometimes be done even on asynchronous views, breaking (with false-negative) the tests that need this delay to reproduce bugs of missing synchronous updates (Refs #16371). The fix here is even simpler then the (wrong) old code - we just add the sleep to the existing function apply_to_remote_endpoints() instead of making the caller even more complex. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-12-21 11:44:50 +02:00
Nadav Har'El	37b5c03865	mv: coroutinize wait code for remote view updates In the previous patch we added a delay injection point (for testing) in the view update code. Because the code was using continuation style, this resulted in increased indentation and ugly repetition of captures. So in this patch we coroutinize the code that waits for remote view updates, making it simpler, shorter, and less indented. Note that this function still uses continuations in one place: The remote view update is still composed of two steps that need to happen one after another, but we don't necessarily need to wait for them to happen. This is easiest to do with chaining continuations, and then either waiting or not waiting for the resulting future. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-12-17 20:15:08 +02:00
Nadav Har'El	bf6848d277	mv, test: add injection point to delay remove view update It's difficult to write a test (as we plan to do in to in the next patch) that verifies that synchronous view updates are indeed synchronous, i.e., that write with CL=QUORUM on the base-table write returns only after CL=QUORUM was also achieved in the view table. The difficulty is that in a fast test machine, even if the synchronous-view-update is completely buggy, it's likely that by the time the test reads from the view, all view updates will have been completed anyway. So in this patch we introduce an injection point, for testing, named "delay_before_remote_view_update", which adds a delay before the base replica sends its update to the remote view replica (in case the view replica is indeed remote). As usual, this injection point isn't configurable - when enabled it adds a fixed (0.5 second) delay, on all view updates on all tables. The existing code used continuation-style Seastar programming, and the addition of the injection point in this patch made it even uglier, so in the next patch we will coroutine-ize this code. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-12-17 20:15:08 +02:00
Petr Gusev	7b55ccbd8e	token_metadata: drop the template Replace token_metadata2 ->token_metadata, make token_metadata back non-template. No behavior changes, just compilation fixes.	2023-12-12 23:19:54 +04:00
Petr Gusev	e50dbef3e2	database: get_token_metadata -> new token_metadata database::get_token_metadata() is switched to token_metadata2. get_all_ips method is added to the host_id-based token_metadata, since its convenient and will be used in several places. It returns all current nodes converted to inet_address by means of the topology contained within token_metadata. hint_sender::can_send: if the node has already left the cluster we may not find its host_id. This case is handled in the same way as if it's not a normal token owner - we simply send a hint to all replicas.	2023-12-12 23:19:53 +04:00

1 2 3 4 5 ...

538 Commits