This is mainly to make next patch simpler. Also this makes the backlog
controller API smaller by removing its sg() method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only difference between those two are in the way backlog controller
is created. It's much simpler to have the controller construction logic
in compaction manager instead. Similar "trick" is used to construct
flush controller for the database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This way it's more compact and easier to extend.
Also it's small enough to fix indentation right at once.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it the future-returning method and setup the _stop_future in its
only caller. Makes next patch much simpler
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch set
- adds log before and after batch log replay
- removes a duplicated call to trigger batch log replay
- removes obsoletes log
Closes#10800
* github.com:scylladb/scylla:
storage_service: Remove obsolete log
storage_service: Do not call do_batch_log_replay again in unbootstrap
storage_service: Add log for start and stop of batchlog replay
A pre-scrub view snapshot cannot be attributed to user error, so no
call to bail out.
Closes#10760.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10783
* github.com:scylladb/scylla:
api-doc: correct spelling
allow pre-scrub snapshots of materialized views and secondary indices
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drop tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
Closes#10807
* github.com:scylladb/scylla:
memtable: Add counters for tombstone compaction
memtable, cache: Eagerly compact data with tombstones
memtable: Subtract from flushed memory when cleaning
mvcc: Introduce apply_resume to hold state for partition version merging
test: mutation: Compare against compacted mutations
compacting_reader: Drop irrelevant tombstones
mutation_partition: Extract deletable_row::compact_and_expire()
mvcc: Apply mutations in memtable with preemption enabled
test: memtable: Make failed_flush_prevents_writes() immune to background merging
This patch adds an extensive array of tests for the Cassandra feature
that Scylla hasn't implemented yet (issues #2962, #8745, #10707) of
indexing the keys, values or entries of a collection column.
The goal of these tests is to explicitly exercise every corner case
I could think of by looking at the documentation of this feature and
considering its possible implementation - and as usual, making sure
that the tests actually pass on Cassandra.
These tests overlap some of the existing unit tests that we translated
from Cassandra, as well as some randomized tests that do not necessarily
cover the same edge cases as these tests cover.
All tests added in this patch pass on Cassandra, but currently fail
on Scylla due to the above issues.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10771
The most time-consuming part is invoking "ninja -t compdb", and there
is no need to repeat that for every mode.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10733
`sstable_directory::process_sstables_dir` may hit an exception when calling `handle_component`.
In this case we currently destroy the `sstable_dir_lister` variable without closing the `directory_lister` first -
leading to terminate in `~directory_lister` as seen in #10697.
This mini-series handles this exception and always closes the `directory_lister`.
Add unit test to reproduce this issue.
Fixes#10697Closes#10754
* github.com:scylladb/scylla:
sstable_directory: process_sstable_dir: fixup indentation
sstable_directory: process_sstable_dir: close directory_lister on error
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.
Fix by triggering soft pressure on retries.
Fixes#10801
Refs #10793
(cherry picked from commit 0e78ad50ea)
Closes#10802
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.
In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
drops the keyspace and shuts down before compaction finishes. Dropping
keyspace drops tables, and each dropped table is smp::count writes to
system.local table with flush after write, which creates tens of
thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
during repair resulted in excessive flushing of small sstables which
compaction couldn't keep up with.
In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
Fixes https://github.com/scylladb/scylla/issues/4116
Changelog:
v5:
- added a nicer way of timing the stalls caused by waiting for flush
- added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups
- added comment why we trigger compaction before waiting for sstable count reduction
- removed unnecessary cv.signal from table::stop
v4:
- removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset
v3:
- removed unnecessary change to scheduling groups from v2
- moved sstables_changed signalling to suggested place in table::stop
- added log how long the table flush was blocked for
- changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <=
v2:
- Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well.
- Converted table::stop to coroutines.
- Reordered commits so that test is committed after fix, so it doesn't trip up bisection.
Closes#10717
* github.com:scylladb/scylla:
table: Add test where compaction doesn't keep up with flush rate.
random_mutation_generator: Add option to specify ks_name and cf_name
table: Prevent creating unbounded number of sstables
Otherwise, if we don't consume all lister's entries,
~directory_lister terminates since the
directory_lister is destroyed without being closed.
Add unit test to reproduce this issue.
Fixes#10697
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.
One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.
Fixes#652.
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.
This will start happening after tombstones are compacted with rows on
partition version merging.
This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
Partition version merging is preemptable. It may stop in the middle
and be resumed later. Currently, all state is kept inside the versions
themselves, in the form of elements in the source version which are
yet to be moved. This will change once we add compaction (tombstones
with rows) into the merging algorithm. There, state cannot be encoded
purley within versions. Consider applying a partition tombstone over
large number of rows.
This patch introduces apply_rows object to hold the necessary state to
make sure forward progress in case of preemption.
No change in behavior yet.
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.
Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
Preerequisite for eagerly applying tombstones, which we want to be
preemptible. Before the patch, apply path to the memtable was not
preemptible.
Because merging can now be defered, we need to involve snapshots to
kick-off background merging in case of preemption. This requires us to
propagate region and cleaner objects, in order to create a snapshot.
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.
Fix by triggering soft pressure on retries.
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.
In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
drops the keyspace and shuts down before compaction finishes. Dropping
keyspace drops tables, and each dropped table is smp::count writes to
system.local table with flush after write, which creates tens of
thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
during repair resulted in excessive flushing of small sstables which
compaction couldn't keep up with.
In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
The series fixes a couple of crashes that were found during starting and
stopping Scylla with raft while doing ddl operations. Most of them
related to shutdown order between different components.
Also in scylla-dev gleb/group0-fixes-v1
CI https://jenkins.scylladb.com/job/releng/job/Scylla-CI/749/
* origin-dev/gleb/group0-fixes-v1:
migration manager: remove unused code
db/system_distributed_keyspace: do not announce empty schema
main: stop raft before the migration manager
storage_service: do not pass the raft group manager to storage_service constructor
main: destroy the group0_client after stopping the group0
Previously, any attempt to take a materialized view or secondary index
snapshot was considered a mistake and caused the snapshot operation to
abort, with a suggestion to snapshot the base table instead.
But an automatic pre-scrub snapshot of a view cannot be attributed to
user error, so the operation should not be aborted in that case.
(It is an open question whether the more correct thing to do during
pre-scrub snapshot would be to silently ignore views. Or perhaps they
should be ignored in all cases except when the user explicitly asks to
snapshot them, by name)
Closes#10760.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
An expr::constant is an expression that happens to represent a constant,
so it's too heavyweight to be used for evaluation. Right now the extra
weight is just a type (which causes extra work by having to maintain
the shared_ptr reference count), but it will grow in the future to include
source location (for error reporting) and maybe other things.
Prior to e9b6171b5 ("Merge 'cql3: expr: unify left-hand-side and
right-hand-side of binary_operator prepares' from Avi Kivity"), we had
to use expr::constant since there was not enough type infomation in
expressions. But now every expression carries its type (in programming
language terms, expressions are now statically typed), so carrying types
in values is not needed.
So change evaluate() to return cql3::raw_value. The majority of the
patch just changes that. The rest deals with some fallout:
- cql3::raw_value gains a view() helper to convert to a raw_value_view,
and is_null_or_unset() to match with expr::constant and reduce further
churn.
- some helpers that worked on expr::constant and now receive a
raw_value now need the type passed via an additional argument. The
type is computed from the expression by the caller.
- many type checks during expression evaluation were dropped. This is
a consequence of static typing - we must trust the expression prepare
phase to perform full type checking since values no longer carry type
information.
Closes#10797
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.
Fixes#10780Reopens#652.
active_memtable().empty() becomes true once seal_active_memtable
succeeds with _memtables->add_memtable(), not when it is able
to flush the (once active) memtable.
In contrast, min_memtable_timestamp() returns api::max_timestamp
only if there is no data in any memtable.
Fixes#10793
Backport notes:
- Introduced in f6d9d6175f (currently in
branch-5.0)
- backport requires also 0e78ad50ea
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10798
and LSIs' from Nadav Har'El
This series includes three small fixes (and of course, tests) for
various edge cases of GSI and LSI handling in Alternator:
1. We add the IndexArn that were missing in DescribeTable for indexes
(GSI and LSI)
2. We forbid the same name to be used for both GSI and LSI (allowing it
was a bug, not a feature)
3. We improve the error handling when trying to tag a GSI or LSI, which
is not currently allowed (it's also not allowed in DynamoDB).
Closes#10791
* github.com:scylladb/scylla:
alternator: improve error handling when trying to tag a GSI or LSI
alternator: forbid duplicate index (LSI and GSI) names
alternator: add ARN for indexes (LSI and GSI)
The left-hand-side of a binary_operator is currently evaluated via
a get_value() function that receives the row values. On the other hand,
the right hand side is evaluated via evaluate(), which receives query_options
in order to resolve bind variables.
This series unifies the two paths into evaluate(), and standardizes the different
inputs into a new evaluation_inputs struct. The old hacks column_value_eval_bag
and column_maybe_subscripted are removed.
Closes#10782
* github.com:scylladb/scylla:
cql3: expr: drop column_maybe_subscripted
cql3: expr: possible_lhs_values(): open-code get_value_comparator()
cql3: expr: rationalize lhs/rhs argument order
cql3: expr: don't rely on grammar when comparing tuples
cql3: expr: wire column_value and subscript to evaluate()
cql3: get_value(subscript): remove gratuitous pointer
cql3: expr: reindent get_value(subscript)
cql3: expr: extract get_value(subscript) from get_value(column_maybe_subscripted)
cql3: raw_value: add missing conversion from managed_bytes_opt&&
cql3: prepare_expr: prepare subscript type
cql3: expr: drop internal 'column_value_eval_bag'
cql3: expr: change evalute() to accept evaluation_inputs
cql3: expr: make evaluate(<expression subtype>) static
cql3: expr: push is_satisfied_by regular and static column extraction to callers
cql3: expr: convert is_satisfied_by() signature to evaluation_inputs
cql3: expr: introduce evaluation_inputs
In issue #10786, we raised the idea of maybe allowing to tag (with
TagResource) GSIs and LSIs, not just base tables. However, currently,
neither DynamoDB nor Syclla allows it. So in this patch we add a
test that confirms this. And while at it, we fix Alternator to
return the same error message as DynamoDB in this case.
Refs #10786.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Adding an LSI and GSI with the same name to the same Alternator table
should be forbidden - because if both exists only one of them (the GSI)
would actually be usable. DynamoDB also forbids such duplicate name.
So in this patch we add a test for this issue, and fix it.
Since the patch involves a few more uses of the IndexName string,
we also clean up its handling a bit, to use std::string_view instead
of the old-style std::string&.
Fixes#10789
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB gives an ARN ("Amazon Resource Name") to LSIs and GSIs. These
look like BASEARN/index/INDEXNAME, where BASEARN is the ARN of the base
table, and INDEXNAME is the name of the LSI or the GSI.
These ARNs should be returned by DescribeTable as part of its
description of each index, and this patch adds that missing IndexArn
field.
The ARN we're adding here is hardly useful (e.g., as explained in
issue #10786, it can't be used to add tags to the index table),
but nevertheless should exist for compatibility with DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Older versions of the Python Cassandra driver had a bug where a single empty
page aborts a scan.
The test test_secondary_index.py::test_filter_and_limit uses filtering and deliberately
tiny pages, so it turns out that some of them are empty, so the test breaks on buggy
versions of the driver, which cause the test to fail when run by developers who happen
to have old versions of the driver.
So in this small series we skip this test when running on a buggy version of the driver.
Fixes#10763Closes#10766
* github.com:scylladb/scylla:
test/cql-pytest: skip another test on older, buggy, drivers
test/cql-pytest: de-duplicate code checking for an old buggy driver
update_history can take a long time compared to compaction, as a call
issued on shard S1 can be handled on shard S2. If the other shard is
under heavy load, we may unnecessarily block kicking off a new
compaction. Normally it isn't a problem, as compactions aren't super
frequent, but there were edge cases where the described behaviour caused
compaction to fail to keep up with excessive flushing, leading to too
many sstables on disk and OOM during a read.
There is no need to wait with next compaction until history is updated,
so release the weight earlier to remove unnecessary serialization.
Changelog:
v3:
- explicitly call deregister instead of moving the weight RAII object to release weight
- mark compaction as finished when sstables are compacted, without waiting for history to update
v2:
- Split the patches differently for easier review
- Rebased agains newer master, which contains fixes that failed the debug version of the test
- Removed the test, as it will be provided by [PR#10717](https://github.com/scylladb/scylla/pull/10717)
Closes#10507
* github.com:scylladb/scylla:
compaction: Release compaction weight before updating history.
compaction: Inline compact_sstables_and_update_history call.
compaction: Extract compact_sstables function
compaction: Rename compact_sstables to compact_sstables_and_update_history
compaction: Extract update_history function
compaction: Extract should_update_history function.
compaction: Fetch start_size from compaction_result
compaction: Add tracking start_size in compaction_result.
On 48b6aec16a we mistakenly allowed
check=True on systemd_unit.is_active(), it should be check=False.
We check unit's status by "systemctl is-active" output string,
it returns "active" or "inactive".
But systemctl command returns non-zero status when it returning
"inactive", so we are getting Exception here.
To fix this, we need new option "ignore_error=True" for out(),
and use it in systemd_unit.is_active().
Fixes#10455Closes#10467