I intentionally store lambdas in variables and pass them to
with_scheduling_group using std::ref. Coroutines don't put variables
captured by lambdas on stack frame. If the lambda containing them is not
stored, the captured variables will be lost, resulting in stack/heap use
after free errors. An alternative is to capture variables, then create
local variables inside lambda bodies that contain a copy/moved version
of the captured ones. For example, if the post_flush lambda wasn't
stored in a dedicated variable, then it wouldn't be put on the coroutine
frame. At the first co_await inside of it, the lambda object along with
variables captured by it (old and &newtabs created inside square
brackets) would go away. The underlying objects (e.g. newtabs created in
the outer scope) would still be valid, but the reference to it would be
gone, causing most of the tests to fail.
Message-Id: <20211118131441.215628-2-mikolaj.sieluzycki@scylladb.com>
Add schema parameter so that:
* Caller has better control over schema -- especially relevant for
reverse reads where it is not possible to follow the convention of
passing the query schema which is reversed compared to that of the
mutations.
* Now that we don't depend on the mutations for the schema, we can lift
the restriction on mutations not being empty: this leads to safer
code. When the mutations parameter is empty, an empty reader is
created.
Add "make_" prefix to follow convention of similar reader factory
functions.
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211115155614.363663-1-bdenes@scylladb.com>
Symmetrically to virtual reader one, add the virtual writer
callback on a table that will be in charge of applying the
provided mutation.
If a virtual table doesn't override this apply method the
dedicated exception is thrown. Next patch will catch it and
propagate back to caller, so it's a new exception type, not
existing/std one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main motivation is to have future returning apply (to be used
by next patches). As a side effect -- indentation fix and private
dirty_memory_region_group() method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
table_state is being introduced for compaction subsystem, to remove table dependency
from compaction interface, fix layer violations, and also make unit testing
easier as table_state is an abstraction that can be implemented even with no
actual table backing it.
In this series, compaction strategy interfaces are switching to table_state,
and eventually, we'll make compact_sstables() switch to it too. The idea is
that no compaction code will directly reference a table object, but only work
with the abstraction instead. So compaction subdirectory can stop
including database.hh altogether, which is a great step forward.
"
* 'table_state_v5' of https://github.com/raphaelsc/scylla:
sstable_compaction_test: switch to table_state
compaction: stop including database.hh for compaction_strategy
compaction: switch to table_state in estimated_pending_compactions()
compaction: switch to table_state in compaction_strategy::get_major_compaction_job()
compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction()
DTCS: reduce table dependency for task estimation
LCS: reduce table dependency for task estimation
table: Implement table_state
compaction: make table param of get_fully_expired_sstables() const
compaction_manager: make table param of has_table_ongoing_compaction() const
Introduce table_state
This is the first implementation of table_state, intended to be used
within compaction. It contains everything needed for compaction
strategies. Subsequently, compaction strategy interface will replace
table by table_state, and later all compaction procedures.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's intended to fix a bad layer violation as table was given the
responsibility of disabling compaction for a given table T, but that
logic clearly belongs to compaction_manager instead.
Additionally, gate will be used instead of counter, as former provides
manager with a way to synchronize with functions running under
run_with_compaction_disabled. so remove() can wait for their
termination.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
for symmetry, let's call it perform_* as it doesn't work like submission
functions which doesn't wait for result, like the one for minor
compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If memtable flush is segregated into multiple files, partition
estimation becomes innacurate and consequently bloom filters are
bigger than needed, leading to an increase in memory consumption.
To fix this, let's wire adjust_partition_estimate() into the flush
procedure, such that original estimation will be adjusted if
segregation is going to be performed. That's done by feeding
mutation_source_metadata, which will leave original estimation
unchanged if no segregation is needed, but will adjust it
otherwise.
Fixes#9581.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103141600.65806-2-raphaelsc@scylladb.com>
Without tweaking interface, there was no way to adjust estimated
partitions on flush. For example, when segregating a memtable for
TWCS, all produced sstables would have an estimation equal to
the memtable size, even though each only contains a subset of it,
which leads to a significant increase in memory consumption for
bloom filters. Subsequent work will use this interface to perform
the adjustment.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103141600.65806-1-raphaelsc@scylladb.com>
Until now reversed queries were implemented inside
`querier::consume_page` (more precisely, inside the free function
`consume_page` used by `querier::consume_page`) by wrapping the
passed-in reader into `make_reversing_reader` and then consuming
fragments from the resulting reversed reader.
The first couple of commits change that by pushing the reversing down below
the `make_combined_reader` call in `table::query`. This allows
working on improving reversing for memtables independently from
reversing for sstables.
We then extend the `index_reader` with functions that allow
reading the promoted index in reverse.
We introduce `partition_reversing_data_source`, which wraps an sstable data
file and returns data buffers with contents of a single chosen partition
as if the rows were stored in reverse order.
We use the reversing source and the extended index reader in
`mx_sstable_mutation_reader` to implement efficient (at least in theory)
reversed single-partition reads.
The patchset disables cache for reversed reads. Fast-forwarding
is not supported in the mx reader for reversed queries at this point.
Details in commit messages. Read the commits in topological order
for best review experience.
Refs: #9134
(not saying "Fixes" because it's only for single-partition queries
without forwarding)
Closes#9281
* github.com:scylladb/scylla:
table: add option to automatically bypass cache for reversed queries
test: reverse sstable reader with random schema and random mutations
sstables: mx: implement reversed single-partition reads
sstables: mx: introduce partition_reversing_data_source
sstables: index_reader: add support for iterating over clustering ranges in reverse
clustering_key_filter: clustering_key_filter_ranges owning constructor
flat_mutation_reader: mention reversed schema in make_reversing_reader docstring
clustering_key_filter: document clustering_key_filter_ranges::get_ranges
Currently the new reversing sstable algorithms do not support fast
forwarding and the cache does not yet handle reversed results. This
forced us to disable the cache for reversed queries if we want to
guarantee bounded memory. We introduce an option that does this
automatically (without specifying `bypass cache` in the query) and turn
it on by default.
If the user decides that they prefer to keep the cache at the
cost of fetching entire partitions into memory (which may be viable
if their partitions are small) during reversed queries, the option can
be turned off. It is live-updateable.
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
those stats aren't used in compaction stats API and therefore they
can be removed. end_size is added to compaction_result (needed for
updating history) and start_size can be calculated in advance.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.
This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.
From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.
This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There are now 231 translation units that indirectly include commitlog.hh
due to the need to have access to db::commitlog::force_sync.
Move that type to a new file commitlog_types.hh and make it available
without access to the commitlog class.
This reduces the number of translation units that depend on commitlog.hh
to 84, improving compile time.
Currently no mutation-source supports reading in reverse natively but
we are working on changing that, adding native reverse read support to
memtable, cache and sstable readers. To ensure that all mutation
sources work in a correct and uniform manner when reading in reverse,
we add a reverse test to the mutation source test suite. This test
reverses the data that it passes to `populate()`, then reads in
forward order (in reverse compared to the data order). For this we use
the currently established reverse read API: reverse schema (schema
order == query order) and half-reversed (legacy) slice. All mutation
sources are prepared to work with reversed reads, using the
`make_reversing_reader()` adapter. As we progress with our native
reverse support, we will replace these adapters with native reversing
support. As part of this, we push down the reversing reader adapter
currently existing on the `query::consume_page()` level, to the
individual mutation sources.
Closes#9384
* github.com:scylladb/scylla:
test: mutation_reader_test: reversed version of test_clustering_order_merger_sstable_set
querier: consume_page(): remove now unused max_size parameter
test/lib: mutation_source_test: test reading in reverse
test: mutation_reader_test: clustering_combined_reader_mutation_source_test: prepare for reading in reverse
test: flat_mutation_reader_test: test_reverse_reader_is_mutation_source: prepare for reading in reverse
test: mutation_reader_test: test_manual_paused_evictable_reader_is_mutation_source: use query schema instead of table schema
treewide: move reversing to the mutation sources
mutation_query: reconcilable_result_builder: document reverse query preconditions
sstable_set: time_series_sstable_set: reverse mode
mutlishard_mutation_query: set max result size on used permits
db/virtual_table: streaming_virtual_table::as_mutation_source(): use query schema instead of table schema
flat_mutation_reader: make_reversing_reader(): add convenience stored slice
mutation_reader: evictable_reader: add reverse read support
flat_mutation_reader: make_flat_mutation_reader_from_fragments(): add reverse read support
flat_mutation_reader: flat_mutation_reader_from_mutations(): add reverse read support
flat_mutation_reader: flat_mutation_reader_from_mutations(): document preconditions
query-request: introduce `half_reverse_slice`
flat_mutation_reader_assertions: log what's expected
Currently the following can happen:
1) there's ongoing compaction with input sstable A, so sstable set
and backlog tracker both contains A.
2) ongoing compaction replaces input sstable A by B, so sstable set
contains only B now.
3) schema is updated, so a new backlog tracker is built without A
because sstable set now contains only B.
4) ongoing compaction tries to remove A from tracker, but it was
excluded in step 3.
5) tracker can now have a negative value if table is decreasing in
size, which leads to log(<negative number>) == -NaN
This problem happens because backlog tracker updates are decoupled
from sstable set updates. Given that the essential content of
backlog tracker should be the same as one of sstable set, let's move
tracker management to table.
Whenever sstable set is updated, backlog tracker will be updated with
the same changes, making their management less error prone.
Fixes#9157
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The generic backlog formula is: ALL + PARTIAL - COMPACTING
With transfer_ongoing_charges() we already ignore the effect of
ongoing compactions on COMPACTING as we judge them to be pointless.
But ongoing compactions will run to completion, meaning that output
sstables will be added to ALL anyway, in the formula above.
With stop_tracking_ongoing_compactions(), input sstables are never
removed from the tracker, but output sstables are added, which means
we end up with duplicate backlog in the tracker.
By removing this tracking mechanism, pointless ongoing compaction
will be ignored as expected and the leaks will be fixed.
Later, the intention is to force a stop on ongoing compactions if
strategy has changed as they're pointless anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."
* 'compaction_manager_layer_violation_fix/v3' of github.com:raphaelsc/scylla:
compaction: split compaction info and data for control
compaction_manager: use task when stopping a given compaction type
compaction: remove start_size and end_size from compaction_info
compaction_manager: introduce helpers for task
compaction_manager: introduce explicit ctor for task
compaction: kill sstables field in compaction_info
compaction: kill table pointer in compaction_info
compaction: simplify procedure to stop ongoing compactions
compaction: move management of compaction_info to compaction_manager
compaction: move output run id from compaction_info into task
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
those stats aren't used in compaction stats API and therefore they
can be removed. end_size is added to compaction_result (needed for
updating history) and start_size can be calculated in advance.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.
This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.
From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.
This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Add the content of the compaction stats introduced in the previous patch
to the tracing data. This will help diagnose query performance related
problems caused by tombstones.
Currently the following can happen:
1) there's ongoing compaction with input sstable A, so sstable set
and backlog tracker both contains A.
2) ongoing compaction replaces input sstable A by B, so sstable set
contains only B now.
3) schema is updated, so a new backlog tracker is built without A
because sstable set now contains only B.
4) ongoing compaction tries to remove A from tracker, but it was
excluded in step 3.
5) tracker can now have a negative value if table is decreasing in
size, which leads to log(<negative number>) == -NaN
This problem happens because backlog tracker updates are decoupled
from sstable set updates. Given that the essential content of
backlog tracker should be the same as one of sstable set, let's move
tracker management to table.
Whenever sstable set is updated, backlog tracker will be updated with
the same changes, making their management less error prone.
Fixes#9157
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The generic back formula is: ALL + PARTIAL - COMPACTING
With transfer_ongoing_charges() we already ignore the effect of
ongoing compactions on COMPACTING as we judge them to be pointless.
But ongoing compactions will run to completion, meaning that output
sstables will be added to ALL anyway, in the formula above.
With stop_tracking_ongoing_compactions(), input sstables are never
removed from the tracker, but output sstables are added, which means
we end up with duplicate backlog in the tracker.
By removing this tracking mechanism, pointless ongoing compaction
will be ignored as expected and the leaks will be fixed.
Later, the intention is to force a stop on ongoing compactions if
strategy has changed as they're pointless anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The semaphore inside was never accessed and `max_memory_for_unlimited_query`
was always equal to `*cmd.max_result_size` so the parameter was completely
redundant.
`cmd.max_result_size` is supposed to be always set in the affected
functions - which are executed on the replica side - as soon as the
replica receives the `read_command` object, in case the parameter was
not set by the coordinator. However, we don't have a guarantee at the
type level (it's still an `optional`). Many places used
`*cmd.max_result_size` without even an assertion.
We make the code a bit safer, we check for `cmd.max_result_size` and if
it's indeed engaged, store it in `reader_permit`. We then access it from
`reader_permit` where necessary. If `cmd.max_result_size` is not set, we
assume this is an unlimited query and obtain the limit from
`get_unlimited_query_max_result_size`.
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.
The reader is used by load and stream to read sstables from the upload
directory which are not guaranteed to belong to the local shard.
Using the make_range_sstable_reader instead of
make_local_shard_sstable_reader.
Tests:
backup_restore_tests.py:TestBackupRestore.load_and_stream_using_snapshot_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_2_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_1_test
migration_test.py:TestLoadAndStream.load_and_stream_asymmetric_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_decrease_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_frozen_pk_test
migration_test.py:TestLoadAndStream.load_and_stream_increase_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_primary_replica_only_test
Fixes#9173Closes#9185
The table::run_compaction is a trivial wrapper for
table::compact_sstables.
We have lots of similar {start, trigger, run}_compaction functions.
Dropping the run_compaction wrapper to reduce confusion.
Closes#9161
NOTE: this series depends on a Seastar submodule update, currently queued in next: 0ed35c6af052ab291a69af98b5c13e023470cba3
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. `make_exception_future<>` for futures
2. `co_return coroutine::exception(...)` for coroutines
which return `future<T>` (the mechanism does not work for `future<>`
without parameters, unfortunately)
Tests: unit(release)
Closes#9079
* github.com:scylladb/scylla:
system_keyspace: pass exceptions without throwing
sstables: pass exceptions without throwing
storage_proxy: pass exceptions without throwing
multishard_mutation_query: pass exceptions without throwing
client_state: pass exceptions without throwing
flat_mutation_reader: pass exceptions without throwing
table: pass exceptions without throwing
commitlog: pass exceptions without throwing
compaction: pass exceptions without throwing
database: pass exceptions without throwing
Continuing the work from e4eb7df1a1, let's guarantee
serialization of sstable set updates by making all sites acquire
the mutation permit. Then table no longer rely on serialization
mechanism of row cache's update functions.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210728174740.78826-1-raphaelsc@scylladb.com>
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
Today, table relies on row_cache::invalidate() serialization for
concurrent sstable list updates to produce correct results.
That's very error prone because table is relying on an implementation
detail of invalidate() to get things right.
Instead, let's make table itself take care of serialization on
concurrent updates.
To achieve that, sstable_list_builder is introduced. Only one
builder can be alive for a given table, so serialization is guaranteed
as long as the builder is kept alive throughout the update procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210721001716.210281-1-raphaelsc@scylladb.com>