scylladb

Author	SHA1	Message	Date
Botond Dénes	5e39cedbe3	evictable_reader: remove _reader_created flag This flag is not really needed, because we can just attempt a resume on first use which will fail with the default constructed inactive read handle and the reader will be created via the recreate-after-evicted path. This allows the same path to be used for all reader creation cases, simplifying the logic and more importantly making further patching easier without the special case. To make the recreate path (almost) as cheap for the first reader creation as it was with the special path, `_trim_range_tombstones` and `_validate_partition_key` is only set when really needed. Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210514141511.127735-1-bdenes@scylladb.com>	2021-05-16 14:45:46 +03:00
Botond Dénes	3b57106627	evictable_reader: remove destructor We now have close() which is expected to clean up, no need for cleanup in the destructor and consequently a destructor at all. Message-Id: <20210514112349.75867-1-bdenes@scylladb.com>	2021-05-16 12:19:41 +03:00
Benny Halevy	6e62ec8c24	mutation_reader: shard_reader: get rid of stop Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	fc5e4688db	mutation_reader: multishard_combining_reader: get rid of destructor Now that the multishard_combining_reader is guaranteed to be called there is no longer need for stopping the shard readers in the destructor. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	5b22731f9a	flat_mutation_reader: require close Make flat_mutation_reader::impl::close pure virtual so that all implementations are required to implemnt it. With that, provide a trivial implementation to all implementations that currently use the default, trivial close implementation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	51c96d405d	mutation_reader: evictable_reader: fill_buffer: make sure to close the reader If reader.fill_buffer() fails, we will not call `maybe_pause` and the reader will be destroyed, so make sure to close it. Otherwise, the reader is std:move'ed to `maybe_pause` that either paused using register_inactive_read or further std::move'ed to _reader, in both cases it doesn't need to be closed. `with_closeable` can safely try to close the moved-from reader and do nothing in this case, as the f_m_r::impl was already moved away. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	a3f9dc6e0b	mutation_reader: multishard_combining_reader: implement close Close all underlying shard readers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	58b1da8cf5	mutation_reader: shard_reader: implement close return reader lifecycle policy's destroy_reader future so it can be waited on by caller (multishard_combining_reader::close). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	2c1edb1a94	mutation_reader: reader_lifecycle_policy: return future from destroy_reader So we can wait on it from to-be-introduced shard_reader::close(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	bfe56fd99c	mutation_reader: shard_reader: get rid of _stopped It's unused. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	e1ec401bb6	mutation_reader: evictable_reader: implement close If there's an active reader then close it, else, try to resume the paused reader, and close it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	84206501ae	mutation_reader: foreign_reader: wait for readahead and close underlying reader Move the logic in ~foreign_reader to close() to wait on the read_ahead future and close the underlying reader on the remote shard. Still call close in the background in ~foreign_reader if destroyed without closing to keep the current behavior, but warn about it, until it's proved to be unneeded. Also, added on_iternal_error in close if _read_ahead_future is engaged but _reader is not, since this must never happen and we wait on the _read_ahead_future without the _reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	ea3f2a6536	mutation_reader: restricting_mutation_reader: close underlying reader If a reader was admitted, close it in close(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	f9ae50483f	mutation_reader: merging_reader: close underlying merger Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	dccdbdff95	mutation_reader: mutation_fragment_merger: close underlying producer This will be needed by the merging_reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	761a38ce21	mutation_reader: mutation_reader_merger: make sure to close underlying readers These will be called by merging_reader::close via mutation_fragment_merger::close in the following patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	b140ea6df2	mutation_reader: compacting_reader: implement close Close underlying reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Kamil Braun	7ffb0d826b	clustering_order_reader_merger: handle empty readers The merger could return end-of-stream if some (but not all) of the underlying readers were empty (i.e. not even returning a `partition_start`). This could happen in places where it was used (`time_series_sstable_set::create_single_key_sstable_reader`) if we opened an sstable which did not have the queried partition but passed all the filters (specifically, the bloom filter returned a false positive for this sstable). The commit also extends the random tests for the merger to include empty readers and adds an explicit test case that catches this bug (in a limited scope: when we merge a single empty reader). It also modifies `test_twcs_single_key_reader_filtering` (regression test for #8432) because the time where the clustering key filter is invoked changes (some invocations move from the constructor of the merger to operator()). I checked manually that it still catches the bug when I reintroduce it. Fixes #8445. Closes #8446	2021-04-12 10:34:52 +03:00
Botond Dénes	bc1fcd3db2	multishard_combining_reader: only read from needed shards The multishard combining reader currently assumes that all shards have data for the read range. This however is not always true and in extreme cases (like reading a single token) it can lead to huge read amplification. Avoid this by not pushing shards to `_shard_selection_min_heap` if the first token they are expected to produce falls outside of the read range. Also change the read ahead algorithm to select the shards from `_shard_selection_min_heap`, instead of walking them in shard order. This was wrong in two ways: * Shards may be ordered differently with respect to the first partition they will produce; reading ahead on the next shard in shard order might not bring in data on the next shard the read will continue on. Shard order is only correct when starting a new range and shards are iterated over in the order they own tokens according to the sharding algorithm. * Shards that may not have data relevant to the read range are also considered for read ahead. After this patch, the multishard reader will only read from shards that have data relevant to the read range, both in the case of normal reads and also for read-ahead. Fixes: #8161 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>	2021-02-26 23:29:20 +02:00
Botond Dénes	c3b4c3f451	evictable_reader: reset _range_override after fast-forwarding `_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>	2021-02-17 19:11:00 +02:00
Benny Halevy	d565e3fb57	reader_lifecycle_policy: retire low level try_resume method The caller can now just call sem.unregister_inactive_read(irh) directly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:32:40 +02:00
Benny Halevy	4e8f29ef14	reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader There's no need to hold a unique_ptr<flat_mutation_reader> as flat_mutation_reader itself holds a unique_ptr<flat_mutation_reader::impl> and functions as a unique ptr via flat_mutation_reader_opt. With that, unregister_inactive_read was modified to return a flat_mutation_reader_opt rather than a std::unique_ptr<flat_mutation_reader>, keeping exactly the same semantics. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-02-08 20:32:40 +02:00
Avi Kivity	913d970c64	Merge "Unify inactive readers" from Botond " Currently inactive readers are stored in two different places: * reader concurrency semaphore * querier cache With the latter registering its inactive readers with the former. This is an unnecessarily complex (and possibly surprising) setup that we want to move away from. This series solves this by moving the responsibility if storing of inactive reads solely to the reader concurrency semaphore, including all supported eviction policies. The querier cache is now only responsible for indexing queriers and maintaining relevant stats. This makes the ownership of the inactive readers much more clear, hopefully making Benny's work on introducing close() and abort() a little bit easier. Tests: unit(release, debug:v1) " * 'unify-inactive-readers/v2' of https://github.com/denesb/scylla: reader_concurrency_semaphore: store inactive readers directly querier_cache: store readers in the reader concurrency semaphore directly querier_cache: retire memory based cache eviction querier_cache: delegate expiry to the reader_concurrency_semaphore reader_concurrency_semaphore: introduce ttl for inactive reads querier_cache: use new eviction notify mechanism to maintain stats reader_concurrency_semaphore: add eviction notification facility reader_concurrency_semaphore: extract evict code into method evict()	2021-02-03 10:59:04 +02:00
Botond Dénes	226088d12e	mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag Its not used anymore.	2021-01-22 16:18:59 +02:00
Botond Dénes	4eb65b12a0	mutation_reader: evictable_reader: remove next_partition() workaround `next_partition()` now returns a future<>, so we can forward it to the remote shard in the scope of the next partition call, remove the now obsolete workaround for the synchronous next partition.	2021-01-22 16:18:30 +02:00
Botond Dénes	febd2feb4c	mutation_reader: shard_reader: remove next_partition() workaround `next_partition()` now returns a future<>, so we can forward it to the remote shard in the scope of the next partition call, remove the now obsolete workaround for the synchronous next partition.	2021-01-22 15:53:05 +02:00
Botond Dénes	81da6b756f	mutation_reader: foreign_reader: remove next_partition() workaround `next_partition()` now returns a future<>, so we can forward it to the remote shard in the scope of the next partition call, remove the now obsolete workaround for the synchronous next partition.	2021-01-22 15:30:36 +02:00
Kamil Braun	570d15c7bc	multishard_combining_reader: do not use `smp::count` `multishard_combining_reader` currently only works under the assumption that every table uses the same sharder configured using the node's number of shards. But we could potentially specify a different sharder for a chosen table, e.g. one that puts everything on shard 0. Then this assumption will be broken and the reader causes a segfault. Fixes #7945.	2021-01-21 18:28:18 +02:00
Benny Halevy	29002e3b48	flat_mutation_reader: return future from next_partition To allow it to asynchronously close underlying readers on next_partition(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-13 17:35:07 +02:00
Kamil Braun	5e846b33b8	clustering_order_reader_merger: fix the 0 readers case With 0 readers the merger would produce a `partition_end` fragment when it should immediately return `end_of_stream` instead.	2020-12-18 12:30:40 +01:00
Kamil Braun	0b36c5e116	mutation_reader: introduce clustering_order_reader_merger This abstraction is used to merge the output of multiple readers, each opened for a single partition query, into a non-decreasing stream of mutation_fragments. It is similar to `mutation_reader_merger`, an important difference is that the new merger may select new readers in the middle of a partition after it already returned some fragments from that partition. It uses the new `position_reader_queue` abstraction to select new readers. It doesn't support multi-partition (ring range) queries. The new merger will be later used when reading from sstable sets created by TimeWindowCompactionStrategy. This strategy creates many sstables that are mostly disjoint w.r.t the contained clustering keys, so we can delay opening sstable readers when querying a partition until after we have processed all mutation fragments with positions before the keys contained by these sstables.	2020-11-30 11:55:44 +01:00
Kamil Braun	857911d353	mutation_reader: `generalize combined_mutation_reader` It is now called `merging_reader`, and is used to change a `FragmentProducer` that produces a non-decreasing stream of mutation fragments batches into a `flat_mutation_reader` producing a non-decreasing stream of fragments. The resulting stream of fragments is increasing except for places where we encounter range tombstones (multiple range tombstones may be produced with the same position_in_partition) `merging_reader` is a simple adapter over `mutation_fragment_merger`. The old `combined_mutation_reader` is simply a specialization of `merging_reader` where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that merges the output of multiple readers into one non-decreasing stream of fragment batches. There is no separate class for `combined_mutation_reader` now. Instead, `make_combined_reader` works directly with `merging_reader`.	2020-11-19 14:35:11 +01:00
Kamil Braun	60adee6900	mutation_reader: fix description of mutation_fragment_merger The resulting sequence is not necessarily strictly increasing (e.g. if there are range tombstones).	2020-11-19 14:29:04 +01:00
Botond Dénes	f5323b29d9	mutation_reader: queue_reader: don't set EOS flag on abort If the consumer happens to check the EOS flag before it hits the exception injected by the abort (by calling fill_buffer()), they can think the stream ended normally and expect it to be valid. However this is not guaranteed when the reader is aborted. To avoid consumers falsely thinking the stream ended normally, don't set the EOS flag on abort at all. Additionally make sure the producer is aborted too on abort. In theory this is not needed as they are the one initiating the abort, but better to be safe then sorry. Fixes: #7411 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20201102100732.35132-1-bdenes@scylladb.com>	2020-11-11 13:44:25 +02:00
Pavel Emelyanov	3da3d448c8	range_tombstone: Remove unused schema arg from .set_start Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-11-06 15:13:05 +03:00
Botond Dénes	ff623e70b3	reader_concurrency_semaphore: name permits Require a schema and an operation name to be given to each permit when created. The schema is of the table the read is executed against, and the operation name, which is some name identifying the operation the permit is part of. Ideally this should be different for each site the permit is created at, to be able to discern not only different kind of reads, but different code paths the read took. As not all read can be associated with one schema, the schema is allowed to be null. The name will be used for debugging purposes, both for coredump debugging and runtime logging of permit-related diagnostics.	2020-10-13 12:32:13 +03:00
Botond Dénes	307cdf1e0d	multishard_combining_reader: reader_lifecycle_policy: add permit param to create_reader() Allow the evictable reader managing the underlying reader to pass its own permit to it when creating it, making sure they share the same permit. Note that the two parts can still end up using different permits, when the underlying reader is kept alive between two pages of a paged read and thus keeps using the permit received on the previous page. Also adjust the `reader_context` in multishard_mutation_query.cc to use the passed-in permit instead of creating a new one when creating a new reader.	2020-10-12 15:56:56 +03:00
Botond Dénes	e09ab09fff	multishard_combining_reader: add permit parameter Don't create an own permit, take one as a parameter, like all other readers do, so the permit can be provided by the higher layer, making sure all parts of the logical read use the same permit.	2020-10-12 15:56:56 +03:00
Botond Dénes	600f1c7853	multishard_combining_reader: shard_reader: use multishard reader's permit Don't create a new permit per shard reader, pass down the multishard reader's one to be used by each shard reader. They all belong to the same read, they should use the same permit. Note that despite its name the shard readers are the local representation of a reader living on a remote shard and as such they live on the same shard the multishard combining reader lives on.	2020-10-12 15:56:56 +03:00
Botond Dénes	dd372c8457	flat_mutation_reader: de-virtualize buffer_size() The main user of this method, the one which required this method to return the collective buffer size of the entire reader tree, is now gone. The remaining two users just use it to check the size of the reader instance they are working with. So de-virtualize this method and reduce its responsibility to just returning the buffer size of the current reader instance.	2020-10-06 08:22:56 +03:00
Botond Dénes	6ca0464af5	mutation_fragment: add schema and permit We want to start tracking the memory consumption of mutation fragments. For this we need schema and permit during construction, and on each modification, so the memory consumption can be recalculated and pass to the permit. In this patch we just add the new parameters and go through the insane churn of updating all call sites. They will be used in the next patch.	2020-09-28 11:27:23 +03:00
Botond Dénes	72a88e0257	mutation_fragment: s/as_mutable_range_tombstone/mutate_as_range_tombstone/ We will soon want to update the memory consumption of mutation fragment after each modification done to it, to do that safely we have to forbid direct access to the underlying data and instead have callers pass a lambda doing their modifications. Uses where this method was just used to move the fragment away are converted to use `as_range_tombstone() &&`.	2020-09-28 10:53:56 +03:00
Botond Dénes	0518571e56	flat_mutation_reader: make _buffer a tracked buffer Via a tracked_allocator. Although the memory allocations made by the _buffer shouldn't dominate the memory consumption of the read itself, they can still be a significant portion that scales with the number of readers in the read.	2020-09-28 10:53:56 +03:00
Botond Dénes	77ea44cb73	mutation_reader: extract the two fill_buffer_result into a single one Currently we have two, nearly identical definitions of said struct. Extract it to a common definition and rename it to `remote_fill_buffer_result`.	2020-09-28 10:53:56 +03:00
Botond Dénes	3fab83b3a1	flat_mutation_reader: impl: add reader_permit parameter Not used yet, this patch does all the churn of propagating a permit to each impl. In the next patch we will use it to track to track the memory consumption of `_buffer`.	2020-09-28 10:53:48 +03:00
Botond Dénes	0b0ae18a14	evictable_reader: validate buffer after recreation the underlying The reader recreation mechanism is a very delicate and error-prone one, as proven by the countless bugs it had. Most of these bugs were related to the recreated reader not continuing the read from the expected position, inserting out-of-order fragments into the stream. This patch adds a defense mechanism against such bugs by validating the start position of the recreated reader. Several things are checked: * The partition is the expected one -- the one we were in the middle of or the next if we stopped at partition boundaries. * The partition is in the read range. * The first fragment in the partition is the expected one -- has a an equal or larger position than the next expected fragment. * The fragment is in the clustering range as defined by the slice. As these validations are only done on the slow-path of recreating an evicted reader, no performance impact is expected.	2020-09-25 12:09:00 +03:00
Botond Dénes	91020eef73	evictable_reader: update_next_position(): only use peek'd position on partition boundary `evictable_reader::update_next_position()` is used to record the position the reader will continue from, in the next buffer fill. This position is used to create the partition slice when the underlying reader is evicted and has to be recreated. There is an optimization in this method -- if the underlying's buffer is not empty we peek at the first fragment in it and use it as the next position. This is however problematic for buffer validation on reader recreation (introduced in the next patch), because using the next row's position as the next pos will allow for range tombstones to be emitted with before_key(next_pos.key()), which will trigger the validation. Instead of working around this, just drop this optimization for mid-partition positions, it is inconsequential anyway. We keep it for where it is important, when we detect that we are at a partition boundary. In this case we can avoid reading the current partition altogether when recreating the reader.	2020-09-25 12:09:00 +03:00
Botond Dénes	4f2e7a18e2	evictable_reader: trim range tombstones to the read clustering range Currently mutation sources are allowed to emit range tombstones that are out-of the clustering read range if they are relevant to it. For example a read of a clustering range [ck100, +inf), might start with: range_tombstone{start={ck1, -1}, end={ck200, 1}}, clustering_row{ck100} The range tombstone is relevant to the range and the first row of the range so it is emitted as first, but its position (start) is outside the read range. This is normally fine, but it poses a problem for evictable reader. When the underlying reader is evicted and has to be recreated from a certain clustering position, this results in out-of-order mutation fragments being inserted into the middle of the stream. This is not fine anymore as the monotonicity guarantee of the stream is violated. The real solution would be to require all mutation sources to trim range tombstones to their read range, but this is a lot of work. Until that is done, as a workaround we do this trimming in the evictable reader itself.	2020-09-25 12:09:00 +03:00
Botond Dénes	4944e050e3	mutation_reader: make_combined_reader(): return empty reader when combining 0 readers Avoid creating all the combining machinery when we know there is no data to be had. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200821045602.13096-1-bdenes@scylladb.com>	2020-08-22 20:47:49 +03:00
Botond Dénes	a9013030cf	multishard_mutation_reader: add a trace message for each shard reader created So we can see in the trace output, the shards that actually participated in the reads. There is a single message for each shard reader. Fixes: #6888 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200803132338.95013-1-bdenes@scylladb.com>	2020-08-03 16:24:46 +03:00

1 2 3 4 5

237 Commits