"
This series hardens raft_group_registry::stop_servers
and uses it to drain_on_shutdown, called before
the database is stopped in cql_test_env.
(Not needed for main).
raft_group_registry deferred_stop is introduced right after
the service is started to make sure it's properly stopped
even if there's an exception at any point while starting.
Test: unit(dev)
"
* tag 'raft_group_registry-drain_on_shutdown-v1' of https://github.com/bhalevy/scylla:
cql_test_env: raft_group_registry::drain_on_shutdown before stopping the database
raft_group_registry: harden stop_servers
raft_group_registry: delete unused _shutdown_gate
When row_cache::make_reader() and memtable::make_flat_reader() see that the query result is empty, they return empty_flat_reader, which is a trivial implementation of flat_mutation_reader.
Even though empty_flat_reader doesn't do anything meaningful, it still needs to be created, handled in merging_reader and destroyed. Turns out this is costly.
This patch series replaces hot path uses of empty_flat_reader with an empty optional.
Performance effects:
`perf_simple_query --smp 1`
TPS: 138k -> 168k
allocs/op: 80.2 -> 71.1
insns/op: 49.9k -> 45.1k
`perf_simple_query --smp 1 --enable-cache=1 --flush`
TPS: 125k -> 150k
allocs/op: 79.2 -> 71.1
insns/op: 51.7k -> 47.2k
For a cassandra-stress benchmark (localhost, 100% cache reads) this translates to a TPS increase from ~42k to ~48k per hyperthread.
Note that this optimization is effective for single-partition reads where the queried partition is only in cache/sstables or only in memtables. Other queries (e.g. where the partition is in both cache in memtables and needs to be merged) are unaffected.
Closes#10204
* github.com:scylladb/scylla:
replica: Prefer row_cache::make_reader_opt() to row_cache::make_reader()
row_cache: Add row_cache::make_reader_opt()
replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader()
memtable: Add memtable::make_flat_reader_opt()
[avi: adjust #include for readers/ split]
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.
With changes
real 29m14.051s
user 168m39.071s
sys 5m13.443s
Without changes
real 30m36.203s
user 175m43.354s
sys 5m26.376s
Closes#10194
When there is nothing to read, make_flat_reader() returns an empty (no-op)
reader. But it turns out that constructing, combining and destroying that
empty reader is quite costly.
As an optimization, add an alternative version which returns an empty optional
instead.
database
We're currently stopping raft_gr before
shutting the database down, but we fail to do that if
anything goes wrong before that, e.g. if
distributed_loader::init_non_system_keyspaces fails.
This change splits drain_on_shutdown out of stop()
to stop the raft groups before the database is stopped
and does the rest in a deferred_stop placed right
after the rafr_gr registry is strated.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
stop_servers should never fail since it's called on
the shutdown path.
Use a local gate in stop_servers() to wait on all
background raft group server aborts.
Also, handle theoretical exceptions from server::abort()
to guarantee success.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Recent PR #10092 (propagating read timeouts on coordinator without
throwing) accidentally removed a line which cancelled
`abstract_read_resolver`'s `_timeout` timer after a read failure.
Because of that, it might happen that after a read failure the timer is
triggered and the `_done_promise` is set twice which triggers an assert
in seastar.
This commit brings back the line which cancels the timeout timer.
Fixes: #10193Closes#10206
This is a translation of Cassandra's CQL unit test source file
validation/operations/BatchTest.java into our our cql-pytest framework.
This test file includes 13 tests for various types of BATCH operations.
All tests pass on Scylla - no known or new bugs were reproduced.
Two of the tests involve very slow testing of TTLs, so after verifying
they work I marked them "skip" for now (we can always turn them on later,
perhaps after reducing the length or number of the sleeps).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220313121634.2611423-1-nyh@scylladb.com>
Commit 1c99ed6ced added tracing logs
about the index chosen for the query, but aggregate queries have
a separate code path, which wasn't taken into account.
After this patch, tracing for aggregate queries also includes
this additional information.
Closes#10195
interrupt() makes it sound like it's interrupting the compaction, but it's
actually called *on* interrupt, to handle the interrupt scenario.
Let's rename it to on_interrupt().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311000128.189840-1-raphaelsc@scylladb.com>
The change is mostly mechanical: update all compactor instances to the
_v2 variant and update all call-sites, of which there is not that many.
As a consequence of this patch, queries -- both single-partition and
range-scans -- now do the v2->v1 conversion in the consumers, instead of
in the compactor.
Add a `consume()` overload for range tombstone changes and convert them
internally to range tombstones, as the underlying reconcilable result
is still v1.
Add a consume() overload which takes a range tombstone change and drops
it just like the existing range tombstone overload does: query results
don't care about range tombstones.
The downstream consumer (mutation_querier) already ignores range
tombstones, so no point forwarding them to it. This makes adding v2
support easier too as range tombstone changes can be similarly dropped.
Changing the capture list of a lambda in
forward_service::execute_on_this_shard from [&] to an explicit one
enables grater readability and prevents potential bugs.
Closes#10191
The services' configuration should be performed with the help of
service-specific config that's filled by the service creator. This
is not the case for gossiper that grabs the db::config and keeps
reference on it throughout its lifetime.
This set brings the gossiper configuration to the described form
by putting the needed config bits onto gossip_config (that already
exists and is partially used for gossiper configuration). And two
live-updateable options need extra care.
tests: unit(dev), dtest.simple_boot_shutdown(dev)
* 'br-gossiper-no-db-config' of https://github.com/xemul/scylla:
gossiper: Remove db::config reference from gossiper
gossiper: Keep live-updateable options on gossiper
gossiper: Keep immutable options on gossip_config
Although its API was long converted to v2, its implementation stayed v1
because the memtable and mutation API were still v1. Now that the
memtable flush returns a v2 reader we can have a second look at
converting this. While the mutation API still uses v1, this can easily
be worked around by using going through `mutation_rebuilder_v2`.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220302145945.189607-1-bdenes@scylladb.com>
The series overhauls the compaction_manager::task design and implementation
by properly layering the functionality between the compaction_manager
that deals with generic task execution, and the per-task business logic that is defined
in a set of classes derived from the generic task class.
While at it, the series introduces `task::state` and a set of helper functions to manage it
to prevent leaks in the statistics, fixing #9974.
Two more stats counter were exposed: `completed_tasks` and a new `postponed_tasks`.
Test: sstable_compaction_test
Dtest: compaction_test.py compaction_additional_test.py
Fixes#9974Closes#10122
* github.com:scylladb/scylla:
compaction_manager: use coroutine::switch_to
compaction_manager::task: drop _compaction_running
compaction_manager: move per-type logic to derived task
compaction_manager: task: add state enum
compaction_manager: task: add maybe_retry
compaction_manager: reevaluate_postponed_compactions: mark as noexcept
compaction_manager: define derived task types
compaction_manager: register_metrics: expose postponed_compactions
compaction_manager: register_metrics: expose failed_compactions
compaction_manager: register_metrics: expose _stats.completed_tasks
compaction: add documentation for compaction_type to string conversions
compaction: expose to_string(compaction_type)
compaction_manager: task: standardize task description in log messages
compaction_manager: refactor can_proceed
compaction_manager: pass compaction_manager& to task ctor
compaction_manager: use shared_ptr<task> rather than lw_shared_ptr
compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once
compaction_manager: use compaction_state::lock only to synchronize major and regular compaction
Saving an allocation for running the functor
as a task in the switched-to scheduling group.
Also, switch to the desired scheduling group at
the beginning of the task so that the higher level logic,
like getting the list of sstables to compact
will be performed under the desired scheduling group,
not only the compaction code itself.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replace the _compaction_running boolean member
by calculating _state == state::active
now that setup_new_compaction switches state to
`active`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the business logic into the task specific classes.
Separating initialization during task construction,
from the compaction_done task, moved into
a do_run() method, and in some cases moving
a lambda function that was called per table (as in
rewrite_sstables) into a private method of the
derived class.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add an enum class representing the task state machine
and a switch_state function to transition between the states
and update the corresponding compaction_manager stats counters.
Refs #9974
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reading data from sstables without compacting first puts
unnecessary pressure on the cache. The mutation streams
need to be resolved anyway before passing to subsequent
consumers, so it's better to do it as close to the
source as possible.
Fixes: #3568Closes#10188
"
This patch-set converts the sstable writer to v2, then prepares the
ground for users actually being able to use the v2 variant. Finally it
converts all users to do so and then decommissions the v1 variant.
For users to be able to use the v2 writer API, we first have to add a v2
output to the compactor first, as some users write to sstables via the
compactor.
Tests: unit(dev, release)
"
* 'sstable-writer-v2/v2' of https://github.com/denesb/scylla:
sstables/sstable: remove now unused v1 write_components() variant
mutation_compactor: remove now unused compact_for_compaction
test/boost/mutation_test: migrate to compact_for_mutation_v2
streaming: migrate to v2 variant of sstable writer API
memtable-sstable: migrate to v2 variant of sstable writer API
test: migrate to the v2 variant of the sstable writer API
sstables/sstable: expose v2 variant of write_components()
sstables: convert mx writer to v2
sstables/metadata_collector: use position_in_partition for min/max keys
test/boost/mutation_test: test_compactor_range_tombstone_spanning_many_pages extend to check v2 output too
mutation_reader: convert compacting reader v2
mutation_compactor: add v2 output
mutation_compactor: make _last_clustering_pos track last input
range_tombstone_change: add set_tombstone()
test/lib/mutation_source_test: log name of each run_mutation_source()
To be used in the next patch to generate
a string dscription from the compaction_type.
In theory, we could use compaction_name()
btu the latter returns the compaction type
in all-upper case and that is very different from
what we print to the log today. The all-upper
strings are used for the api layer, e.g. to
stop tasks of a particular compaction type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>