scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-29 11:10:40 +00:00

Author	SHA1	Message	Date
Petr Gusev	a83c0a8bbc	test_secondary_index_collections: change insert/create index order Secondary index creation is asynchronous, meaning it takes time for existing data to be reflected within the index. However, new data added after the index is created should appear in it immediately. The test consisted of two parts. The first created a series of indexes for one table, added test data to the table, and then ran a series of checks. In the second part, several new indexes were added to the same table, and checks were made to make sure that already existing data would appear in them. This last part was flaky. The patch just moves the index creation statements from the second part to the first. Fixes: #14076 Closes #14090 (cherry picked from commit `0415ac3d5f`) Closes #15101	2023-08-24 14:09:08 +03:00
Petr Gusev	aca9e41a44	topology.cc: remove_endpoint: _dc_racks removal fix The eps reference was reused to manipulate the racks dictionary. This resulted in assigning a set of nodes from the racks dictionary to an element of the _dc_endpoints dictionary. This is a backport of `bcb1d7c` to branch-5.2. Refs: #14184 Closes #14893	2023-08-11 14:29:37 +03:00
Botond Dénes	bcb8f6a8dd	Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak If semaphore mismatch occurs, check whether both semaphores belong to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error. Until now, semaphore mismatch was only checked in multi-partition queries. The PR pushes the check to `querier_cache` and perform it on all `lookup__querier` methods. The mismatch can happen if user's scheduling group changed during a query. We don't want to throw an error then, but drop and reset cached reader. This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it. Refers: https://github.com/scylladb/scylla-enterprise/issues/3182 Refers: https://github.com/scylladb/scylla-enterprise/issues/3050 Closes: #14770 Closes #14736 github.com:scylladb/scylladb: querier_cache: add stats of scheduling group mismatches querier_cache: check semaphore mismatch during querier lookup querier_cache: add reference to `replica::database::is_user_semaphore()` replica:database: add method to determine if semaphore is user one (cherry picked from commit `a8feb7428d`)	2023-08-09 10:20:53 +03:00
Nadav Har'El	e34c62c567	Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions. This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed. Fixes: #14819 Closes #14821 * github.com:scylladb/scylladb: test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations db/view/view_updating_consumer: account for the size of mutations mutation/mutation_rebuilder*: return const mutation& from consume_new_partition() mutation/mutation: add memory_usage() (cherry picked from commit `056d04954c`)	2023-07-31 03:43:44 -04:00
Raphael S. Carvalho	d2369fc546	cached_file: Evict unused pages that aren't linked to LRU yet It was found that cached_file dtor can hit the following assert after OOM cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.` cached_file's dtor iterates through all entries and evict those that are linked to LRU, under the assumption that all unused entries were linked to LRU. That's partially correct. get_page_ptr() may fetch more than 1 page due to read ahead, but it will only call cached_page::share() on the first page, the one that will be consumed now. share() is responsible for automatically placing the page into LRU once refcount drops to zero. If the read is aborted midway, before cached_file has a chance to hit the 2nd page (read ahead) in cache, it will remain there with refcount 0 and unlinked to LRU, in hope that a subsequent read will bring it out of that state. Our main user of cached_file is per-sstable index caching. If the scenario above happens, and the sstable and its associated cached_file is destroyed, before the 2nd page is hit, cached_file will not be able to clear all the cache because some of the pages are unused and not linked. A page read ahead will be linked into LRU so it doesn't sit in memory indefinitely. Also allowing for cached_file dtor to clear all cache if some of those pages brought in advance aren't fetched later. A reproducer was added. Fixes #14814. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14818 (cherry picked from commit `050ce9ef1d`)	2023-07-28 13:56:28 +02:00
Kamil Braun	6273c4df35	test: use correct timestamp resolution in `test_group0_history_clearing_old_entries` In `10c1f1dc80` I fixed `make_group0_history_state_id_mutation` to use correct timestamp resolution (microseconds instead of milliseconds) which was supposed to fix the flakiness of `test_group0_history_clearing_old_entries`. Unfortunately, the test is still flaky, although now it's failing at a later step -- this is because I was sloppy and I didn't adjust this second part of the test to also use microsecond resolution. The test is counting the number of entries in the `system.group0_history` table that are older than a certain timestamp, but it's doing the counting using millisecond resolution, causing it to give results that are off by one sometimes. Fix it by using microseconds everywhere. Fixes #14653 Closes #14670 (cherry picked from commit `9d4b3c6036`)	2023-07-27 15:46:37 +02:00
Raphael S. Carvalho	986491447b	table: Optimize creation of reader excluding staging for view building View building from staging creates a reader from scratch (memtable + sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s to INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s Refs #14089. Fixes #14244. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `1d8cb32a5d`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14764	2023-07-20 16:46:15 +03:00
Tomasz Grabiec	1a6f4389ae	Merge 'atomic_cell: compare value last' from Benny Halevy Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in `e225c88a65` ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes scylladb/scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 \Closes scylladb/scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit `87b4606cd6`) Closes #14649	2023-07-12 10:09:56 +03:00
Calle Wilund	1088c3e24a	storage_proxy: Make split_stats resilient to being called from different scheduling group Fixes #11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics. Closes #14636	2023-07-12 09:24:56 +03:00
Botond Dénes	c9cb8dcfd0	Merge '[backport 5.2] view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski View update routines accept mutation objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Backported from `c25201c1a3`. `view_build_test.cc` needed some tiny adjustments for the backport. Closes #14619 Fixes #14503 * github.com:scylladb/scylladb: test: view_build_test: add range tombstones to test_view_update_generator_buffering test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations view_updating_consumer: make buffer limit a variable view: fix range tombstone handling on flushes in view_updating_consumer	2023-07-11 15:04:23 +03:00
Piotr Dulikowski	57d0310dcc	combined: mergers: remove recursion in operator()() In mutation_reader_merger and clustering_order_reader_merger, the operator()() is responsible for producing mutation fragments that will be merged and pushed to the combined reader's buffer. Sometimes, it might have to advance existing readers, open new and / or close some existing ones, which requires calling a helper method and then calling operator()() recursively. In some unlucky circumstances, a stack overflow can occur: - Readers have to be opened incrementally, - Most or all readers must not produce any fragments and need to report end of stream without preemption, - There has to be enough readers opened within the lifetime of the combined reader (~500), - All of the above needs to happen within a single task quota. In order to prevent such a situation, the code of both reader merger classes were modified not to perform recursion at all. Most of the code of the operator()() was moved to maybe_produce_batch which does not recur if it is not possible for it to produce a fragment, instead it returns std::nullopt and operator()() calls this method in a loop via seastar::repeat_until_value. A regression test is added. Fixes: scylladb/scylladb#14415 Closes #14452 (cherry picked from commit `ee9bfb583c`) Closes #14605	2023-07-11 11:09:25 +03:00
Michał Chojnowski	78f25f2d36	test: view_build_test: add range tombstones to test_view_update_generator_buffering This patch adds a full-range tombstone to the compacted mutation. This raises the coverage of the test. In particular, it reproduces issue #14503, which should have been caught by this test, but wasn't.	2023-07-11 09:44:00 +02:00
Michał Chojnowski	14fa3ee34e	test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations A random mutation test for view_updating_consumer's buffering logic. Reproduces #14503.	2023-07-11 09:44:00 +02:00
Michał Chojnowski	75933b9906	view_updating_consumer: make buffer limit a variable The limit doesn't change at runtime, but we this patch makes it variable for unit testing purposes.	2023-07-11 09:44:00 +02:00
Michał Chojnowski	fc7b02c8e4	view: fix range tombstone handling on flushes in view_updating_consumer View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes #14503	2023-07-11 09:44:00 +02:00
Botond Dénes	8e63b2f3e3	Merge 'readers: evictable_reader: don't accidentally consume the entire partition' from Kamil Braun The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction. So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491. There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order. Fix the comparison and adjust one of the tests (added in #13563) to detect this case. After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s. Fixes #13491 Closes #14375 * github.com:scylladb/scylladb: readers: evictable_reader: don't accidentally consume the entire partition test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position (cherry picked from commit `586102b42e`)	2023-06-29 12:04:35 +03:00
Raphael S. Carvalho	58f88897c8	compaction: Fix incremental compaction for sstable cleanup After `c7826aa910`, sstable runs are cleaned up together. The procedure which executes cleanup was holding reference to all input sstables, such that it could later retry the same cleanup job on failure. Turns out it was not taking into account that incremental compaction will exhaust the input set incrementally. Therefore cleanup is affected by the 100% space overhead. To fix it, cleanup will now have the input set updated, by removing the sstables that were already cleaned up. On failure, cleanup will retry the same job with the remaining sstables that weren't exhausted by incremental compaction. New unit test reproduces the failure, and passes with the fix. Fixes #14035. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14038 (cherry picked from commit `23443e0574`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14193	2023-06-13 09:53:46 +03:00
Avi Kivity	f32971b81f	Merge 'multishard_mutation_query: make reader_context::lookup_readers() exception safe' from Botond Dénes With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an entire page worth of work. Also add a unit test checking that the mismatch is detected in the first place and that readers are closed. Fixes: #13784 Closes #13790 * github.com:scylladb/scylladb: test/boost/database_test: add unit test for semaphore mismatch on range scans partition_slice_builder: add set_specific_ranges() multishard_mutation_query: make reader_context::lookup_readers() exception safe multishard_mutation_query: lookup_readers(): make inner lambda a coroutine (cherry picked from commit `1c0e8c25ca`)	2023-06-08 04:29:51 -04:00
Tomasz Grabiec	548a7f73d3	Merge 'range_tombstone_change_generator: fix an edge case in flush()' from Michał Chojnowski range_tombstone_change_generator::flush() mishandles the case when two range tombstones are adjacent and flush(pos, end_of_range=true) is called with pos equal to the end bound of the lesser-position range tombstone. In such case, the start change of the greater-position rtc will be accidentally emitted, and there won't be an end change, which breaks reader assumptions by ending the stream with an unclosed range tombstone, triggering an assertion. This is due to a non-strict inequality used in a place where strict inequality should be used. The modified line was intended to close range tombstones which end exactly on the flush position, but this is unnecessary because such range tombstones are handled by the last `if` in the function anyway. Instead, this line caused range tombstones beginning right after the flush position to be emitted sometimes. Fixes https://github.com/scylladb/scylladb/issues/12462 Closes #13894 * github.com:scylladb/scylladb: tests: row_cache: Add reproducer for reader producing missing closing range tombstone range_tombstone_change_generator: fix an edge case in flush()	2023-05-15 23:29:08 +02:00
Tomasz Grabiec	7c1bdc6553	tests: row_cache: Add reproducer for reader producing missing closing range tombstone Adds a reproducer for #12462. The bug manifests by reader throwing: std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}} The reason is that prior to the fix range_tombstone_change_generator::flush() was used with end_of_range=true to produce the closing range_tombstone_change and it did not handle correctly the case when there are two adjacent range tombstones and flush(pos, end_of_range=true) is called such that pos is the boundary between the two. Cherry-picked from `a717c803c7`.	2023-05-15 18:02:40 +02:00
Botond Dénes	f07a06d390	Merge 'service:forward_service: use long type instead of counter in function mocking' from Michał Jadwiszczak Aggregation query on counter column is failing because forward_service is looking for function with counter as an argument and such function doesn't exist. Instead the long type should be used. Fixes: #12939 Closes #12963 * github.com:scylladb/scylladb: test:boost: counter column parallelized aggregation test service:forward_service: use long type when column is counter (cherry picked from commit `61e67b865a`)	2023-05-07 14:27:29 +03:00
Botond Dénes	0e42defe06	readers: evictable_reader: skip progress guarantee when next pos is partition start The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the last buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward change has a bug: when the next expected position is a partition-start (another partition), the code would loop forever, effectively reading all there is from the underlying reader. To avoid this, add a special case to ignore the progress guarantee loop altogether when the next expected position is a partition start. In this case, progress is garanteed anyway, because there is exactly one partition-start fragment in each partition. Fixes: #13491 Closes #13563 (cherry picked from commit `72003dc35c`)	2023-05-02 21:58:41 +03:00
Benny Halevy	6ad94fedf3	utils: clear_gently: do not clear null unique_ptr Otherwise the null pointer is dereferenced. Add a unit test reproducing the issue and testing this fix. Fixes #13636 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `12877ad026`)	2023-04-24 17:51:01 +03:00
Botond Dénes	7b2215d8e0	Merge 'Backport bugfixes regarding UDT, UDF, UDA interactions to branch-5.2' from Wojciech Mitros This patch backports https://github.com/scylladb/scylladb/pull/12710 to branch-5.2. To resolve the conflicts that it's causing, it also includes * https://github.com/scylladb/scylladb/pull/12680 * https://github.com/scylladb/scylladb/pull/12681 Closes #13542 * github.com:scylladb/scylladb: uda: change the UDF used in a UDA if it's replaced functions: add helper same_signature method uda: return aggregate functions as shared pointers udf: also check reducefunc to confirm that a UDF is not used in a UDA udf: fix dropping UDFs that share names with other UDFs used in UDAs pytest: add optional argument for new_function argument types udt: disallow dropping a user type used in a user function	2023-04-19 01:38:08 -04:00
Botond Dénes	c9a17c80f6	mutation/mutation_compactor: consume_partition_end(): reset _stop The purpose of `_stop` is to remember whether the consumption of the last partition was interrupted or it was consumed fully. In the former case, the compactor allows retreiving the compaction state for the given partition, so that its compaction can be resumed at a later point in time. Currently, `_stop` is set to `stop_iteration::yes` whenever the return value of any of the `consume()` methods is also `stop_iteration::yes`. Meaning, if the consuming of the partition is interrupted, this is remembered in `_stop`. However, a partition whose consumption was interrupted is not always continued later. Sometimes consumption of a partitions is interrputed because the partition is not interesting and the downstream consumer wants to stop it. In these cases the compactor should not return an engagned optional from `detach_state()`, because there is not state to detach, the state should be thrown away. This was incorrectly handled so far and is fixed in this patch, but overwriting `_stop` in `consume_partition_end()` with whatever the downstream consumer returns. Meaning if they want to skip the partition, then `_stop` is reset to `stop_partition::no` and `detach_state()` will return a disengaged optional as it should in this case. Fixes: #12629 Closes #13365 (cherry picked from commit `bae62f899d`)	2023-04-18 02:32:24 -04:00
Wojciech Mitros	51f19d1b8c	udt: disallow dropping a user type used in a user function Currently, nothing prevents us from dropping a user type used in a user function, even though doing so may make us unable to use the function correctly. This patch prevents this behavior by checking all function argument and return types when executing a drop type statement and preventing it from completing if the type is referenced by any of them. (cherry picked from commit `86c61828e6`)	2023-04-17 13:13:35 +02:00
Botond Dénes	3e10c3fc89	reader_concurrency_semaphore: don't evict inactive readers needlessly Inactive readers should only be evicted to free up resources for waiting readers. Evicting them when waiters are not admitted for any other reason than resources is wasteful and leads to extra load later on when these evicted readers have to be recreated end requeued. This patch changes the logic on both the registering path and the admission path to not evict inactive readers unless there are readers actually waiting on resources. A unit-test is also added, reproducing the overly-agressive eviction and checking that it doesn't happen anymore. Fixes: #11803 Closes #13286 (cherry picked from commit `bd57471e54`)	2023-04-14 10:37:30 +03:00
Pavel Emelyanov	487ba9f3e1	Merge '[backport] reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()' from Botond Dénes This PR backports `2f4a793457` to branch-5.2. Said patch depends on some other patches that are not part of any release yet. This PR should apply to 5.1 and 5.0 too. Closes #13162 * github.com:scylladb/scylladb: reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict() reader_permit: expose operator<<(reader_permit::state) reader_permit: add get_state() accessor	2023-03-16 18:41:08 +03:00
Botond Dénes	bd4f9e3615	Merge 'readers/nonforwarding: don't emit partition_end on next_partition,fast_forward_to' from Gusev Petr The series fixes the `make_nonforwardable` reader, it shouldn't emit `partition_end` for previous partition after `next_partition()` and `fast_forward_to()` Fixes: #12249 Closes #12978 * github.com:scylladb/scylladb: flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE make_nonforwardable: test through run_mutation_source_tests make_nonforwardable: next_partition and fast_forward_to when single_partition is true make_forwardable: fix next_partition flat_mutation_reader_v2: drop forward_buffer_to nonforwardable reader: fix indentation nonforwardable reader: refactor, extract reset_partition nonforwardable reader: add more tests nonforwardable reader: no partition_end after fast_forward_to() nonforwardable reader: no partition_end after next_partition() nonforwardable reader: no partition_end for empty reader row_cache: pass partition_start though nonforwardable reader (cherry picked from commit `46efdfa1a1`)	2023-03-16 10:42:03 +02:00
Botond Dénes	c68deb2461	reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict() Instead of open-coding the same, in an incomplete way. clear_inactive_reads() does incomplete eviction in severeal ways: * it doesn't decrement _stats.inactive_reads * it doesn't set the permit to evicted state * it doesn't cancel the ttl timer (if any) * it doesn't call the eviction notifier on the permit (if there is one) The list goes on. We already have an evict() method that all this correctly, use that instead of the current badly open-coded alternative. This patch also enhances the existing test for clear_inactive_reads() and adds a new one specifically for `stop()` being called while having inactive reads. Fixes: #13048 Closes #13049 (cherry picked from commit `2f4a793457`)	2023-03-14 09:50:16 +02:00
Raphael S. Carvalho	22c1685b3d	sstables: Temporarily disable loading of first and last position metadata It's known that reading large cells in reverse cause large allocations. Source: https://github.com/scylladb/scylladb/issues/11642 The loading is preliminary work for splitting large partitions into fragments composing a run and then be able to later read such a run in an efficiency way using the position metadata. The splitting is not turned on yet, anywhere. Therefore, we can temporarily disable the loading, as a way to avoid regressions in stable versions. Large allocations can cause stalls due to foreground memory eviction kicking in. The default values for position metadata say that first and last position include all clustering rows, but they aren't used anywhere other than by sstable_run to determine if a run is disjoint at clustering level, but given that no splitting is done yet, it does not really matter. Unit tests relying on position metadata were adjusted to enable the loading, such that they can still pass. Fixes #11642. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12979 (cherry picked from commit `d73ffe7220`)	2023-02-27 08:58:34 +02:00
Avi Kivity	0b418fa7cf	cql3, transport, tests: remove "unset" from value type system The CQL binary protocol introduced "unset" values in version 4 of the protocol. Unset values can be bound to variables, which cause certain CQL fragments to be skipped. For example, the fragment `SET a = :var` will not change the value of `a` if `:var` is bound to an unset value. Unsets, however, are very limited in where they can appear. They can only appear at the top-level of an expression, and any computation done with them is invalid. For example, `SET list_column = [3, :var]` is invalid if `:var` is bound to unset. This causes the code to be littered with checks for unset, and there are plenty of tests dedicated to catching unsets. However, a simpler way is possible - prevent the infiltration of unsets at the point of entry (when evaluating a bind variable expression), and introduce guards to check for the few cases where unsets are allowed. This is what this long patch does. It performs the following: (general) 1. unset is removed from the possible values of cql3::raw_value and cql3::raw_value_view. (external->cql3) 2. query_options is fortified with a vector of booleans, unset_bind_variable_vector, where each boolean corresponds to a bind variable index and is true when it is unset. 3. To avoid churn, two compatiblity structs are introduced: cql3::raw_value{,_view}_vector_with_unset, which can be constructed from a std::vector<raw_value{,_view/}>, which is what most callers have. They can also be constructed with explicit unset vectors, for the few cases they are needed. (cql3->variables) 4. query_options::get_value_at() now throws if the requested bind variable is unset. This replaces all the throwing checks in expression evaluation and statement execution, which are removed. 5. A new query_options::is_unset() is added for the users that can tolerate unset; though it is not used directly. 6. A new cql3::unset_operation_guard class guards against unsets. It accepts an expression, and can be queried whether an unset is present. Two conditions are checked: the expression must be a singleton bind variable, and at runtime it must be bound to an unset value. 7. The modification_statement operations are split into two, via two new subclasses of cql3::operation. cql3::operation_no_unset_support ignores unsets completely. cql3::operation_skip_if_unset checks if an operand is unset (luckily all operations have at most one operand that tolerates unset) and applies unset_operation_guard to it. 8. The various sites that accept expressions or operations are modified to check for should_skip_operation(). This are the loops around operations in update_statement and delete_statement, and the checks for unset in attributes (LIMIT and PER PARTITION LIMIT) (tests) 9. Many unset tests are removed. It's now impossible to enter an unset value into the expression evaluation machinery (there's just no unset value), so it's impossible to test for it. 10. Other unset tests now have to be invoked via bind variables, since there's no way to create an unset cql3::expr::constant. 11. Many tests have their exception message match strings relaxed. Since unsets are now checked very early, we don't know the context where they happen. It would be possible to reintroduce it (by adding a format string parameter to cql3::unset_operation_guard), but it seems not to be worth the effort. Usage of unsets is rare, and it is explicit (at least with the Python driver, an unset cannot be introduced by ommission). I tried as an alternative to wrap cql3::raw_value{,_view} (that doesn't recognize unsets) with cql3::maybe_unset_value (that does), but that caused huge amounts of churn, so I abandoned that in favor of the current approach. Closes #12517	2023-01-16 21:10:56 +02:00
Kamil Braun	bed555d1e5	db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config' Make it clear that the table stores the snapshot configuration, which is not necessarily the currently operating configuration (the last one appended to the log). In the future we plan to have a separate virtual table for showing the currently operating configuration, perhaps we will call it `system.raft_config`.	2023-01-12 16:21:26 +01:00
Michał Radwański	dcab289656	boost/mvcc_test: use failure_injecting_allocation_strategy where it is meant to In test_apply_is_atomic, a basic form of exception testing is used. There is failure_injecting_allocation_strategy, which however is not used for any allocation, since for some reason, `with_allocator(r.allocator()` is used instead of `with_allocator(alloc`. Fix that. Closes #12354	2023-01-10 12:01:36 +01:00
Tomasz Grabiec	ebcd736343	cache: Fix undefined behavior when populating with non-full keys Regression introduced in `23e4c8315`. view_and_holder position_in_partiton::after_key() triggers undefined behavior when the key was not full because the holder is moved, which invalidates the view. Fixes #12367 Closes #12447	2023-01-10 12:51:54 +02:00
Raphael S. Carvalho	05ffb024bb	replica: Kill table::calculate_shard_from_sstable_generation() Inferring shard from generation is long gone. We still use it in some scripts, but that's no longer needed in Scylla, when loading the SSTables, and it also conflicts with ongoing work of UUID-based generations. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12476	2023-01-09 20:17:57 +02:00
Nadav Har'El	d6e6820f33	Merge 'Drop support for cql binary protocols versions 1 and 2' from Avi Kivity The CQL binary protocol version 3 was introduced in 2014. All Scylla version support it, and Cassandra versions 2.1 and newer. Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer use 32-bit collection sizes. Unfortunately, we implemented support for multiple serialization formats very intrusively, by pushing the format everywhere. This avoids the need to re-serialize (sometimes) but is quite obnoxious. It's also likely to be broken, since it's almost untested and it's too easy to write cql_serialization_format::internal() instead of propagating the client specified value. Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's easy to verify that they are no longer in use on a running system by examining the `system.clients` table before upgrade. Fixes #10607 Closes #12432 * github.com:scylladb/scylladb: treewide: drop cql_serialization_format cql: modification_statement: drop protocol check for LWT transport: drop cql protocol versions 1 and 2	2023-01-09 18:52:41 +02:00
Tomasz Grabiec	f97268d8f2	row_cache: Fix violation of the "oldest version are evicted first" when evicting last dummy Consider the following MVCC state of a partition: v2: ==== <7> [entry2] ==== <9> ===== <last dummy> v1: ================================ <last dummy> [entry1] Where === means a continuous range and --- means a discontinuous range. After two LRU items are evicted (entry1 and entry2), we will end up with: v2: ---------------------- <9> ===== <last dummy> v1: ================================ <last dummy> [entry1] This will cause readers to incorrectly think there are no rows before entry <9>, because the range is continuous in v1, and continuity of a snapshot is a union of continuous intervals in all versions. The cursor will see the interval before <9> as continuous and the reader will produce no rows. This is only temporary, because current MVCC merging rules are such that the flag on the latest entry wins, so we'll end up with this once v1 is no longer needed: v2: ---------------------- <9> ===== <last dummy> ...and the reader will go to sstables to fetch the evicted rows before entry <9>, as expected. The bug is in rows_entry::on_evicted(), which treats the last dummy entry in a special way, and doesn't evict it, and doesn't clear the continuity by omission. The situation is not easy to trigger because it requires certain eviction pattern concurrent with multiple reads of the same partition in different versions, so across memtable flushes. Closes #12452	2023-01-09 16:10:52 +02:00
Avi Kivity	5ffe4fee6d	Merge 'Remove legacy half reverse' from Michał Radwański This commit removes consume_in_reverse::legacy_half_reverse, an option once used to indicate that the given key ranges are sorted descending, based on the clustering key of the start of the range, and that the range tombstones inside partition would be sorted (descending, as all the mutation fragments would) according to their end (but range tombstone would still be stored according to their start bound). As it turns out, mutation::consume, when called with legacy_half_reverse option produces invalid fragment stream, one where all the row tombstone changes come after all the clustering rows. This was not an issue, since when constructing results from the query, Scylla would not pass the tombstones to the client, but instead compact data beforehand. In this commit, the consume_in_reverse::legacy_half_reverse is removed, along with all the uses. As for the swap out in mutation_partition.cc in query_mutation and to_data_query_result: The downstream was not prepared to deal with legacy_half_reverse. mutation::consume contains ``` if (reverse == consume_in_reverse::yes) { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } else { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } ``` So why did it work at all? to_data_query_result deals with a single slice. The used consumer (compact_for_query_v2) compacts-away the range tombstone changes, and thus the only difference between the consume_in_reverse::no and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys and the second one was ordered decreasing. This property is maintained if we swap out for the consume_in_reverse::yes format. Refs: #12353 Closes #12453 * github.com:scylladb/scylladb: mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse mutation_partition_view: treat query::partition_slice::option::reversed in to_data_query_result as consume_in_reverse::yes mutation: move consume_in_reverse def to mutation_consumer.hh	2023-01-08 15:42:00 +02:00
Botond Dénes	c4688563e3	sstables: track decompressed buffers Convert decompressed temporary buffers into tracked buffers just before returning them to the upper layer. This ensures these buffers are known to the reader concurrency semaphore and it has an accurate view of the actual memory consumption of reads. Fixes: #12448 Closes #12454	2023-01-08 15:34:28 +02:00
Wojciech Mitros	996a942e05	test: assert that WASM allocations can fail without crashing The main source of big allocations in the WASM UDF implementation is the WASM Linear Memory. We do not want Scylla to crash even if a memory allocation for the WASM Memory fails, so we assert that an exception is thrown instead. The wasmtime runtime does not actually fail on an allocation failure (assuming the memory allocator does not abort and returns nullptr instead - which our seastar allocator does). What happens then depends on the failed allocation handling of the code that was compiled to WASM. If the original code threw an exception or aborted, the resulting WASM code will trap. To make sure that we can handle the trap, we need to allow wasmtime to handle SIGILL signals, because that what is used to carry information about WASM traps. The new test uses a special WASM Memory allocator that fails after n allocations, and the allocations include both memory growth instructions in WASM, as well as growing memory manually using the wasmtime API. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2023-01-06 14:07:29 +01:00
Wojciech Mitros	f05d612da8	wasm: limit memory allocated using mmap The wasmtime runtime allocates memory for the executable code of the WASM programs using mmap and not the seastar allocator. As a result, the memory that Scylla actually uses becomes not only the memory preallocated for the seastar allocator but the sum of that and the memory allocated for executable codes by the WASM runtime. To keep limiting the memory used by Scylla, we measure how much memory do the WASM programs use and if they use too much, compiled WASM UDFs (modules) that are currently not in use are evicted to make room. To evict a module it is required to evict all instances of this module (the underlying implementation of modules and instances uses shared pointers to the executable code). For this reason, we add reference counts to modules. Each instance using a module is a reference. When an instance is destroyed, a reference is removed. If all references to a module are removed, the executable code for this module is deallocated. The eviction of a module is actually acheved by eviction of all its references. When we want to free memory for a new module we repeatedly evict instances from the wasm_instance_cache using its LRU strategy until some module loses all its instances. This process may not succeed if the instances currently in use (so not in the cache) use too much memory - in this case the query also fails. Otherwise the new module is added to the tracking system. This strategy may evict some instances unnecessarily, but evicting modules should not happen frequently, and any more efficient solution requires an even bigger intervention into the code.	2023-01-06 14:07:29 +01:00
Wojciech Mitros	b8d28a95bf	wasm: add configuration options for instance cache and udf execution Different users may require different limits for their UDFs. This patch allows them to configure the size of their cache of wasm, the maximum size of indivitual instances stored in the cache, the time after which the instances are evicted, the fuel that all wasm UDFs are allowed to consume before yielding (for the control of latency), the fuel that wasm UDFs are allowed to consume in total (to allow performing longer computations in the UDF without detecting an infinite loop) and the hard limit of the size of UDFs that are executed (to avoid large allocations)	2023-01-06 14:07:27 +01:00
Wojciech Mitros	3214f5c2db	test: check that wasmtime functions yield The new implementation for WASM UDFs allows executing the UDFs in pieces. This commit adds a test asserting that the UDF is in fact divided and that each of the execution segments takes no longer than 1ms.	2023-01-06 14:05:53 +01:00
Michał Radwański	1fbf433966	mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse This commit removes consume_in_reverse::legacy_half_reverse, an option once used to indicate that the given key ranges are sorted descending, based on the clustering key of the start of the range, and that the range tombstones inside partition would be sorted (descending, as all the mutation fragments would) according to their end (but range tombstone would still be stored according to their start bound). As it turns out, mutation::consume, when called with legacy_half_reverse option produces invalid fragment stream, one where all the row tombstone changes come after all the clustering rows. This was not an issue, since when constructing results from the query, Scylla would not pass the tombstones to the client, but instead compact data beforehand. In this commit, the consume_in_reverse::legacy_half_reverse is removed, along with all the uses. As for the swap out in mutation_partition.cc in query_mutation and to_data_query_result: The downstream was not prepared to deal with legacy_half_reverse. mutation::consume contains ``` if (reverse == consume_in_reverse::yes) { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } else { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } ``` So why did it work at all? to_data_query_result deals with a single slice. The used consumer (compact_for_query_v2) compacts-away the range tombstone changes, and thus the only difference between the consume_in_reverse::no and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys and the second one was ordered decreasing. This property is maintained if we swap out for the consume_in_reverse::yes format.	2023-01-05 18:48:55 +01:00
Michał Jadwiszczak	83bb77b8bb	test/boost/cql_query_test: enable `parallelized_aggregation` Run tests for parallelized aggregation with `enable_parallelized_aggregation` set always to true, so the tests work even if the default value of the option is false. Closes #12409	2023-01-04 10:11:25 +02:00
Avi Kivity	2739ac66ed	treewide: drop cql_serialization_format Now that we don't accept cql protocol version 1 or 2, we can drop cql_serialization format everywhere, except when in the IDL (since it's part of the inter-node protocol). A few functions had duplicate versions, one with and one without a cql_serialization_format parameter. They are deduplicated. Care is taken that `partition_slice`, which communicates the cql_serialization_format across nodes, still presents a valid cql_serialization_format to other nodes when transmitting itself and rejects protocol 1 and 2 serialization\ format when receiving. The IDL is unchanged. One test checking the 16-bit serialization format is removed.	2023-01-03 19:54:13 +02:00
Raphael S. Carvalho	b4e4bbd64a	database_test: Reduce x_log2_compaction_group values to avoid timeout database_test in timing out because it's having to run the tests calling do_with_cql_env_and_compaction_groups 3x, one for each compaction group setting. reduce it to 2 settings instead of 3 if running in debug mode. Refs #12396. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12421	2023-01-01 13:56:18 +02:00
Raphael S. Carvalho	e7380bea65	test: mutation_test: Test multiple compaction groups Extends mutation_test to run the tests with more than one compaction group, in addition to a single one (default). Piggyback on existing tests. Avoids duplication. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 12:36:07 -03:00
Raphael S. Carvalho	e3e7c3c7e5	test: database_test: Test multiple compaction groups Extends database_test to run the tests with more than one compaction group, in addition to a single one (default). Piggyback on existing tests. Avoids duplication. Caught a bug when snapshotting, in implementation of table::can_flush(), showing its usefulness. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 12:36:07 -03:00

1 2 3 4 5 ...

2125 Commits