Todays alloc() accepts migrate-fn, size and alignment. All the callers
don't really need to provide anything special for the migrate-fn and
are just happy with default alignof() for alignment. The simplification
is in providing alloc() that only accepts size arg and does the rest
itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently `cql_test_env` runs its `func` in the default (main) group and
also leaves all scheduling groups in `dbcfg` default initialized to the
same scheduling group. This results in every part of the system,
normally isolated from each other, running in the same (default)
scheduling group. Not a big problem on its own, as we are talking about
tests, but this creates an artificial difference between the test and
the real environment, which is ever more pronounced since certain query
parameters are selected based on the current scheduling group.
To bring cql test env just that little bit closer to the real thing,
this patch creates all the scheduling groups main does (well almost) and
configures `dbcfg` with them.
Creating and destroying the scheduling group on each setup-teardown of
cql test env breaks some internal seastar components which don't like
seeing the same scheduling group with the same name but different id. So
create the scheduling groups once on first access and keep them around
until the test executable is running.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>
Currently `with_cql_test_env()` is equivalent to
`with_cql_test_env_thread()`, which resulted in many tests using the
former while really needing the latter and getting away with it. This
equivalence is incidental and will go away soon, so make sure all cql
test env using tests that expect to be run in a thread use the
appropriate variant.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-1-bdenes@scylladb.com>
"
The current scrub compaction has a serious drawback, while it is
very effective at removing any corruptions it recognizes, it is very
heavy-handed in its way of repairing such corruptions: it simply drops
all data that is suspected to be corrupt. While this *is* the safest way
to cleanse data, it might not be the best way from the point of view of
a user who doesn't want to loose data, even at the risk of retaining
some business-logic level corruption. Mind you, no database-level scrub
can ever fully repair data from the business-logic point of view, they
can only do so on the database-level. So in certain cases it might be
desirable to have a less heavy-handed approach of cleansing the data,
that tries as hard as it can to not loose any data.
This series introduces a new scrub mode, with the goal of addressing
this use-case: when the user doesn't want to loose any data. The new
mode is called "segregate" and it works by segregating its input into
multiple outputs such that each output contains a valid stream. This
approach can fix any out-of-order data, be that on the partition or
fragment level. Out-of-order partitions are simply written into a
separate output. Out of order fragments are handled by injecting a
partition-end/partition-start pair right before them, so that they are
now in a separate (duplicate) partition, that will just be written into
a separate output, just like a regular out-of-order partition.
The reason this series is posted as an RFC is that although I consider
the code stable and tested, there are some questions related to the UX.
* First and foremost every scrub that does more than just discard data
that is suspected to be corrupt (but even these a certain degree) have
to consider the possibility that they are rehabilitating corruptions,
leaving them in the system without a warning, in the sense that the
user won't see any more problems due to low-level corruptions and
hence might think everything is alright, while data is still corrupt
from the business logic point of view. It is very hard to draw a line
between what should and shouldn't scrub do, yet there is a demand from
users for scrub that can restore data without loosing any of it. Note
that anybody executing such a scrub is already in a bad shape, even if
they can read their data (they often can't) it is already corrupt,
scrub is not making anything worse here.
* This series converts the previous `skip_corrupted` boolean into an
enum, which now selects the scrub mode. This means that
`skip_corrupted` cannot be combined with segregate to throw out what
the former can't fix. This was chosen for simplicity, a bunch of
flags, all interacting with each other is very hard to see through in
my opinion, a linear mode selector is much more so.
* The new segregate mode goes all-in, by trying to fix even
fragment-level disorder. Maybe it should only do it on the partition
level, or maybe this should be made configurable, allowing the user to
select what to happen with those data that cannot be fixed.
Tests: unit(dev), unit(sstable_datafile_test:debug)
"
* 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla:
test: boost/sstable_datafile_test: add tests for segregate mode scrub
api: storage_service/keyspace_scrub: expose new segregate mode
sstables: compaction/scrub: add segregate mode
mutation_fragment_stream_validator: add reset methods
mutation_writer: add segregate_by_partition
api: /storage_service/keyspace_scrub: add scrub mode param
sstables: compaction/scrub: replace skip_corrupted with mode enum
sstables: compaction/scrub: prevent infinite loop when last partition end is missing
tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
Uses the infrastructure for testing mutation_sources, but only a
subset of it which does not do fast forwarding (since virtual_table
does not support it).
utils::phased_barrier holds a `lw_shared_ptr<gate>` that is
typically `enter()`ed in `phased_barrier::start()`,
and left when the operation is destroyed in `~operation`.
Currently, the operation move-assign implementation is the
default one that just moves the lw_shared gate ptr from the
other operation into this one, without calling `_gate->leave()` first.
This change first destroys *this when move-assigned (if not self)
to call _gate->leave() if engaged, before reassigning the
gate with the other operation::_gate.
A unit test that reproduces the issue before this change
and passes with the fix was added to serialized_action_test.
Fixes#8613
Test: unit(dev), serialized_action_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210510120703.1520328-1-bhalevy@scylladb.com>
"
The current printout is has multiple problems:
* It is segregated by state, each having its own sorting criteria;
* Number of permits and count resources is collapsed in to a single
column, not clear which is the one printed.
* Number of available/initial units of the semaphore are not printed;
This series solves all this problems:
* It merges all states into a single table, sorted by memory
consumption, in descending order.
* It separates number of permits and count resources into separate
columns.
* Prints a summary of the semaphore units.
* Provides a cap on the maximum amount of printable lines, to not blow
up the logs.
The goal of all this is to make it easy to find the culprit a semaphore
problem: easily spot the big memory consumers, then unpack the name
column to determine which table and code path is responsible.
This brings the printout close to the recently `scylla reads`
scylla-gdb.py command, providing a uniform report format across the two
tools.
Example report:
INFO 2021-05-07 09:52:16,806 [shard 0] testlog - With max-lines=4: Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/2147483647 count and 263599186/9223372036854775807 memory resources: user request, dumping permit diagnostics:
permits count memory table/description/state
7 2 77M ks.tbl1/op1/active
6 3 59M ks.tbl1/op0/active
4 0 36M ks.tbl1/op2/active
3 1 36M ks.tbl0/op2/active
11 2 43M permits omitted for brevity
31 8 251M total
"
* 'reader-concurrency-semaphore-dump-improvement/v1' of https://github.com/denesb/scylla:
test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics
reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header
reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines
reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order
reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table
reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources
In commit 3e39985c7a we added the Cassandra-compatible system table
system."IndexInfo" (note the capitalized table name) which lists built
indexes. Because we already had a table of built materialized views, and
indexes are implemented as materialized views, the index list was
implemented as a virtual table based on the view list.
However, the *name* of each materialized view listed in the list of
views looks like something_index, with the suffix "_index", while the
name of the table we need to print is "something". We forgot to do this
transformation in the virtual table - and this is what this patch does.
This bug can confuse applications which use this system table to wait for
an index to be built. Several tests translated from Cassandra's unit
tests, in cassandra_tests/validation/entities/secondary_index_test.py fail
in wait_for_index() because of this incompatibility, and pass after this
patch.
This patch also changes the unit test that enshrined the previous, wrong,
behavior, to test for the correct behavior. This problem is typical of
C++ unit tests which cannot be run against Cassandra.
Fixes#8600
Unfortunately, although this patch fixes "typical" applications (including
all tests which I tried) - applications which read from IndexInfo in a
"typical" method to look for a specific index being ready, the
implementation is technically NOT correct: The problem is that index
names are not sorted in the right order, because they are sorted with
the "_index" prefix.
To give an example, the index names "a" should be listed before "a1", but
the view names "a1_index" comes before "a_index" (because in ASCII, 1
comes before underscore). I can't think of any way to fix this bug
without completely reimplementing IndexInfo in a different way - probably
based on a temporary memtable (which is fine as this is not a
performance-critical operation). We'll need to do this rewrite eventually,
and I'll open a new issue.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210509140113.1084497-1-nyh@scylladb.com>
Ref: #7617
This series adds timeout parameters to service levels.
Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this:
```cql
CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s;
ATTACH SERVICE LEVEL sl2 TO cassandra;
ALTER SERVICE LEVEL sl2 WITH write_timeout = null;
```
Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`).
Closes#7913
* github.com:scylladb/scylla:
docs: add a paragraph describing service level timeouts
test: add per-service-level timeout tests
test: add refreshing client state
transport: add updating per-service-level params
client_state: allow updating per service level params
qos: allow returning combined service level options
qos: add a way of merging service level options
cql3: add preserving default values for per-sl timeouts
qos: make getting service level public
qos: make finding service level public
treewide: remove service level controller from query state
treewide: propagate service level to client state
sstables: disambiguate boost::find
cql3: add a timeout column to LIST SERVICE LEVEL statement
db: add extracting service level info via CQL
types: add a missing translation for cql_duration
cql3: allow unsetting service level timeouts
cql3: add validating service level timeout values
db: add setting service level params via system_distributed
cql3: add fetching service level attrs in ALTER and CREATE
cql3: add timeout to service level params
qos: add timeout to service level info
db,sys_dist_ks: add timeout to the service level table
migration_manager: allow table updates with timestamp
cql3: allow a null keyword for CQL properties
"
Storage service needs migration notifier reference to pass it to cdc
service via get_local_storage_service(). This set removes
- get_local_storage_service from cdc
- migration notifier from storage service
- db_context::builder from cdc (released nuclear binding energy)
tests: unit(dev)
"
* 'br-cdc-no-storage-service' of https://github.com/xemul/scylla:
storage_service: Remove migration notifier dependency
cdc: Remove db_context::builder
cdc: Provide migration notifier right at once
cdc: Remove db_context::builder::with_migration_notifier
Not really testing anything, at least not automatically. It just
provides coverage for the diagnostics dump code, as well as allows for
developers to inspect the printout visually when making changes.
In order to avoid needless schema disagreements, a way of announcing
a schema change with fixed timestamp is added.
That way, when nodes update schemas of their internal tables (e.g.
during updates), it's possible for all nodes to use an identical
timestamp for this operation, which in turn makes their digests
identical.
With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.
This is fixed by respecting the offstrategy threshold. Unit test is
added.
Fixes#8573.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly.
This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case.
Test results (perf_simple_query --smp 1 --task-quota-ms 10):
```
before: median 184810.98 tps ( 91.1 allocs/op, 20.1 tasks/op, 54564 insns/op)
after: median 192125.99 tps ( 87.1 allocs/op, 20.1 tasks/op, 53673 insns/op)
```
4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes).
The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile.
I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles.
Closes#8592
* github.com:scylladb/scylla:
storage_proxy, treewide: use utils::small_vector inet_address_vector:s
storage_proxy, treewide: introduce names for vectors of inet_address
utils: small_vector: add print operator for std::ostream
hints: messages.hh: add missing #include
The mutation_reader_test is already one of our largest test files.
Move the reader concurrency semaphore related tests to a new file,
making them easier to find making the mutation reader test a little bit
smaller too.
These two tests (restricted_reader_timeout and
restricted_reader_max_queue_length) are testing the semaphore in
reality, but through the restricted reader, which is distracting as it
needlessly brings in an additional layer into the picture. Rewrite them
to test the semaphore directly, getting much lighter in the process.
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
Similar to the already existing get_reader_concurrency_semaphore(),
this method determines the appropriate max result size for the query
class, which is deduced from the current scheduling group. This method
shares its scheduling group -> query class association mechanism with
the above mentioned semaphore getter.
Add a new segregator which segregates a stream, potentially containing
duplicate or even out-of-order partitions, into multiple output streams,
such that each output stream has strictly monotonic partitions.
This segregator will be used by a new scrub compaction mode which is
meant to fix sstables containing duplicate or out-of-order data.
Before this change, `cdc$deleted_` columns were all `NULL` in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example:
```
INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);
```
For this example, pre-image generated for the second operation would look like this (in both `true` and `full` pre-image mode):
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
`v=NULL` has two meanings:
1. If pre-image was in `true` mode, `v=NULL` describes that v was not affected (affected columns: pk, ck, v2).
2. If pre-image was in `full` mode, `v=NULL` describes that v was equal to `NULL` in the pre-image.
Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image).
A solution to this problem is to fill in the `cdc$deleted_` columns for pre-images. After this PR, for the `INSERT` described above, CDC now generates the following log row:
If in pre-image 'true' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
If in pre-image 'full' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1
```
A client library now can properly decode a pre-image row. If it sees a `NULL` value, it can now check the `cdc$deleted_` column to determine if this `NULL` value was a part of pre-image or it was omitted due to not being an affected column in the delta operation.
No such change is necessary for the post-image rows, as those images are always generated in the `full` mode.
Additional example:
Additional example of trouble decoding pre-images before this change.
tbl2 - `true` pre-image mode, tbl3 - `full` pre-image mode:
```
INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);
```
```
INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
```
INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
Both pre-images look the same, but:
1. `v=NULL` in tbl2 describes v being omitted from the pre-image.
2. `v=NULL` in tbl3 described v being `NULL` in the pre-image.
Closes#8568
* github.com:scylladb/scylla:
cdc: log: assert post_image is always in full mode
cdc: tests: check cdc$deleted_ columns in images
cdc: log: fill cdc$deleted_ columns in pre-images
Add a test that checks whether the cdc$deleted_ columns are properly
filled in the pre/post-image rows.
This test checks tables with only atomic columns, tables with frozen
collections and non-frozen collections. The test is performed with
both 'true' pre-image mode and 'full' pre-image mode.
Introduce a tagged id struct for `group_id`.
Raft code would want to generate quite a lot of unique
raft groups in the future (e.g. tablets). UUID is designed
exactly for that (e.g. larger capacity than `uint64_t`, obviously,
and also has built-in procedures to generate random ids).
Also, this is a preparation to make "raft group 0" use a random
ID instead of a literal fixed `0` as a group id.
The purpose is that every scylla cluster must have a unique ID
for "raft group 0" since we don't want the nodes from some other
cluster to disrupt the current cluster. This can happen if,
for some reason, a foreign node happens to contact a node in
our cluster.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>
The only reason why storage service keeps a refernce on the migration
notifier is that the latter was needed by cdc before previous patch.
Now cdc gets the notifier directly from main, so storage service is
a bit more off the hook.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The fuzzy test consumes a large chunk of resource from the semaphore
up-front to simulate a contested semaphore. This isn't an accurate
simulation, because no permit will have more than 1 units in reality.
Furthermore this can even cause a deadlock since 8aaa3a7 as now we rely
on all count units being available to make forward progress when memory
is scarce.
This patch just cuts out this part of the test, we now have a dedicated
unit test for checking a heavily contested semaphore, that does it
properly, so no need to try to fix this clumsy attempt that is just
making trouble at this point.
Refs: #8493
Tests: release(multishard_mutation_query_test:fuzzy_test)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>
This unit test checks that the semaphore doesn't get into a deadlock
when contended, in the presence of many memory-only reads (that don't
wait for admission). This is tested by simulating the 3 kind of reads we
currently have in the system:
* memory-only: reads that don't pass admission and only own memory.
* admitted: reads that pass admission.
* evictable: admitted reads that are furthermore evictable.
The test creates and runs a large number of these reads in parallel,
read kinds being selected randomly, then creates a watchdog which
kills the test if no progress is being made.
This unit test passes a read through admission again-and-again, just
like an evictable reader would be during its lifetime. When readmitted
the read sometimes has to wait and sometimes not. This is to check that
the readmitting a previously admitted reader doesn't leak any units.
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.
fa43d7680 recently introduced mandatory closing of readers before they
are destroyed. One reader destroy path that was left not closing the
reader before destruction is `inactive_reader_handle::abandon()`. This
path is executed when the handle is destroyed while still referring to a
non-evicted inactive read. This patch fixes it up to close the reader
and adds a small unit test which checks that this happens.
"
This patchset adds future-returning close methods to all
flat_mutation_reader-s and makes sure that all readers
are explicitly closed and waited for.
The main motivation for doing so is for providing a path
for cancelling outstanding i/o requests via a the input_stream
close (See https://github.com/scylladb/seastar/issues/859)
and wait until they complete.
Also, this series also introduces a stop
method to reader_concurrency_semaphore to be used when
shutting down the database, instead of calling
clear_inactive_readers in the database destructor.
The series does not change microbenchmarks performance in a significant way.
It looks like the results are within the tests' jitter.
- perf_simple_query: (in transactions per second, more is better)
before: median 184701.83 tps (90 allocs/op, 20 tasks/op)
after: median 188970.69 tps (90 allocs/op, 20 tasks/op) (+2.3%)
- perf_mutation_readers: (in time per iteration, less is better)
combined.one_row 65.042ns -> 57.961ns (-10.9%)
combined.single_active 46.634us -> 46.216us ( -0.9%)
combined.many_overlapping 364.752us -> 371.507us ( +1.9%)
combined.disjoint_interleaved 43.634us -> 43.448us ( -0.4%)
combined.disjoint_ranges 43.011us -> 42.991us ( -0.0%)
combined.overlapping_partitions_disjoint_rows 57.609us -> 58.820us ( +2.1%)
clustering_combined.ranges_generic 93.464ns -> 96.236ns ( +3.0%)
clustering_combined.ranges_specialized 86.537ns -> 87.645ns ( +1.3%)
memtable.one_partition_one_row 903.546ns -> 957.639ns ( +6.0%)
memtable.one_partition_many_rows 6.474us -> 6.444us ( -0.5%)
memtable.one_large_partition 905.593us -> 878.271us ( -3.0%)
memtable.many_partitions_one_row 13.815us -> 14.718us ( +6.5%)
memtable.many_partitions_many_rows 161.250us -> 158.590us ( -1.6%)
memtable.many_large_partitions 24.237ms -> 23.348ms ( -3.7%)
average -0.02%
Fixes#1076
Refs #2927
Test: unit(release, debug)
Perf: perf_mutation_readers, perf_simple_query (release)
Dtest: next-gating(release),
materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_max_to_half_test repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test(debug)
"
* tag 'flat_mutation_reader-close-v7' of github.com:bhalevy/scylla: (94 commits)
mutation_reader: shard_reader: get rid of stop
mutation_reader: multishard_combining_reader: get rid of destructor
flat_mutation_reader: abort if not closed before destroyed
flat_mutation_reader: require close
repair: row_level_repair: run: close repair_meta when done
repair: repair_reader: close underlying reader on_end_of_stream
perf: everywhere: close flat_mutation_reader when done
test: everywhere: close flat_mutation_reader when done
mutation_partition: counter_write_query: close reader when done
index: built_indexes_reader: implement close
mutation_writer: multishard_writer: close readers when done
mutation_writer: feed_writer: close reader when done
table: for_all_partitions_slow: close iteration_step reader when done
view_builder: stop: close all build_step readers
stream_transfer_task: execute: close send_info reader when done
view_update_generator: start: close staging_sstable_reader when done
view: build_progress_virtual_reader: implement close method
view: generate_view_updates: close builder readers when done
view_builder: initialize_reader_at_current_token: close reader before reassigning it
view_builder: do_build_step: close build_step reader when done
...
Make flat_mutation_reader::impl::close pure virtual
so that all implementations are required to implemnt it.
With that, provide a trivial implementation to
all implementations that currently use the default,
trivial close implementation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Close the _closing_gate to wait on background
close of dropped queries, and close all remaining queriers.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Make sure to close the querier and subsequently its reader before
destroying it (unless it was moved).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In addition to clear_inactive_reads, that's currently called when
the database object is destroyed, introduce a stop() method that will:
1. wait on all background closes of inactive_reads.
2. close all present inactive_reads and waits on their close.
3. signal waiters on the wait_list via broken() with a proper
exception indicating that the semaphore was closed.
In addition, assert in the semaphore's destructor
that it has no remaining inactive reads.
Stop must be called from whoever owns the r_c_s.
Mainly, from database::stop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the logic in ~foreign_reader to close()
to wait on the read_ahead future and close the underlying
reader on the remote shard. Still call close in the background
in ~foreign_reader if destroyed without closing to keep the current
behavior, but warn about it, until it's proved to be unneeded.
Also, added on_iternal_error in close if _read_ahead_future
is engaged but _reader is not, since this must never happen
and we wait on the _read_ahead_future without the _reader.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Close _delegate if it's engaged both in the close() method
and when ever it is currently reset by _delegate = {}.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>