Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.
Fix by triggering soft pressure on retries.
Fixes#10801
Refs #10793
(cherry picked from commit 0e78ad50ea)
Closes#10802
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.
In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
drops the keyspace and shuts down before compaction finishes. Dropping
keyspace drops tables, and each dropped table is smp::count writes to
system.local table with flush after write, which creates tens of
thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
during repair resulted in excessive flushing of small sstables which
compaction couldn't keep up with.
In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
Fixes https://github.com/scylladb/scylla/issues/4116
Changelog:
v5:
- added a nicer way of timing the stalls caused by waiting for flush
- added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups
- added comment why we trigger compaction before waiting for sstable count reduction
- removed unnecessary cv.signal from table::stop
v4:
- removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset
v3:
- removed unnecessary change to scheduling groups from v2
- moved sstables_changed signalling to suggested place in table::stop
- added log how long the table flush was blocked for
- changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <=
v2:
- Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well.
- Converted table::stop to coroutines.
- Reordered commits so that test is committed after fix, so it doesn't trip up bisection.
Closes#10717
* github.com:scylladb/scylla:
table: Add test where compaction doesn't keep up with flush rate.
random_mutation_generator: Add option to specify ks_name and cf_name
table: Prevent creating unbounded number of sstables
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.
In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
drops the keyspace and shuts down before compaction finishes. Dropping
keyspace drops tables, and each dropped table is smp::count writes to
system.local table with flush after write, which creates tens of
thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
during repair resulted in excessive flushing of small sstables which
compaction couldn't keep up with.
In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
The series fixes a couple of crashes that were found during starting and
stopping Scylla with raft while doing ddl operations. Most of them
related to shutdown order between different components.
Also in scylla-dev gleb/group0-fixes-v1
CI https://jenkins.scylladb.com/job/releng/job/Scylla-CI/749/
* origin-dev/gleb/group0-fixes-v1:
migration manager: remove unused code
db/system_distributed_keyspace: do not announce empty schema
main: stop raft before the migration manager
storage_service: do not pass the raft group manager to storage_service constructor
main: destroy the group0_client after stopping the group0
An expr::constant is an expression that happens to represent a constant,
so it's too heavyweight to be used for evaluation. Right now the extra
weight is just a type (which causes extra work by having to maintain
the shared_ptr reference count), but it will grow in the future to include
source location (for error reporting) and maybe other things.
Prior to e9b6171b5 ("Merge 'cql3: expr: unify left-hand-side and
right-hand-side of binary_operator prepares' from Avi Kivity"), we had
to use expr::constant since there was not enough type infomation in
expressions. But now every expression carries its type (in programming
language terms, expressions are now statically typed), so carrying types
in values is not needed.
So change evaluate() to return cql3::raw_value. The majority of the
patch just changes that. The rest deals with some fallout:
- cql3::raw_value gains a view() helper to convert to a raw_value_view,
and is_null_or_unset() to match with expr::constant and reduce further
churn.
- some helpers that worked on expr::constant and now receive a
raw_value now need the type passed via an additional argument. The
type is computed from the expression by the caller.
- many type checks during expression evaluation were dropped. This is
a consequence of static typing - we must trust the expression prepare
phase to perform full type checking since values no longer carry type
information.
Closes#10797
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.
Fixes#10780Reopens#652.
active_memtable().empty() becomes true once seal_active_memtable
succeeds with _memtables->add_memtable(), not when it is able
to flush the (once active) memtable.
In contrast, min_memtable_timestamp() returns api::max_timestamp
only if there is no data in any memtable.
Fixes#10793
Backport notes:
- Introduced in f6d9d6175f (currently in
branch-5.0)
- backport requires also 0e78ad50ea
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10798
and LSIs' from Nadav Har'El
This series includes three small fixes (and of course, tests) for
various edge cases of GSI and LSI handling in Alternator:
1. We add the IndexArn that were missing in DescribeTable for indexes
(GSI and LSI)
2. We forbid the same name to be used for both GSI and LSI (allowing it
was a bug, not a feature)
3. We improve the error handling when trying to tag a GSI or LSI, which
is not currently allowed (it's also not allowed in DynamoDB).
Closes#10791
* github.com:scylladb/scylla:
alternator: improve error handling when trying to tag a GSI or LSI
alternator: forbid duplicate index (LSI and GSI) names
alternator: add ARN for indexes (LSI and GSI)
The left-hand-side of a binary_operator is currently evaluated via
a get_value() function that receives the row values. On the other hand,
the right hand side is evaluated via evaluate(), which receives query_options
in order to resolve bind variables.
This series unifies the two paths into evaluate(), and standardizes the different
inputs into a new evaluation_inputs struct. The old hacks column_value_eval_bag
and column_maybe_subscripted are removed.
Closes#10782
* github.com:scylladb/scylla:
cql3: expr: drop column_maybe_subscripted
cql3: expr: possible_lhs_values(): open-code get_value_comparator()
cql3: expr: rationalize lhs/rhs argument order
cql3: expr: don't rely on grammar when comparing tuples
cql3: expr: wire column_value and subscript to evaluate()
cql3: get_value(subscript): remove gratuitous pointer
cql3: expr: reindent get_value(subscript)
cql3: expr: extract get_value(subscript) from get_value(column_maybe_subscripted)
cql3: raw_value: add missing conversion from managed_bytes_opt&&
cql3: prepare_expr: prepare subscript type
cql3: expr: drop internal 'column_value_eval_bag'
cql3: expr: change evalute() to accept evaluation_inputs
cql3: expr: make evaluate(<expression subtype>) static
cql3: expr: push is_satisfied_by regular and static column extraction to callers
cql3: expr: convert is_satisfied_by() signature to evaluation_inputs
cql3: expr: introduce evaluation_inputs
In issue #10786, we raised the idea of maybe allowing to tag (with
TagResource) GSIs and LSIs, not just base tables. However, currently,
neither DynamoDB nor Syclla allows it. So in this patch we add a
test that confirms this. And while at it, we fix Alternator to
return the same error message as DynamoDB in this case.
Refs #10786.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Adding an LSI and GSI with the same name to the same Alternator table
should be forbidden - because if both exists only one of them (the GSI)
would actually be usable. DynamoDB also forbids such duplicate name.
So in this patch we add a test for this issue, and fix it.
Since the patch involves a few more uses of the IndexName string,
we also clean up its handling a bit, to use std::string_view instead
of the old-style std::string&.
Fixes#10789
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB gives an ARN ("Amazon Resource Name") to LSIs and GSIs. These
look like BASEARN/index/INDEXNAME, where BASEARN is the ARN of the base
table, and INDEXNAME is the name of the LSI or the GSI.
These ARNs should be returned by DescribeTable as part of its
description of each index, and this patch adds that missing IndexArn
field.
The ARN we're adding here is hardly useful (e.g., as explained in
issue #10786, it can't be used to add tags to the index table),
but nevertheless should exist for compatibility with DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Older versions of the Python Cassandra driver had a bug where a single empty
page aborts a scan.
The test test_secondary_index.py::test_filter_and_limit uses filtering and deliberately
tiny pages, so it turns out that some of them are empty, so the test breaks on buggy
versions of the driver, which cause the test to fail when run by developers who happen
to have old versions of the driver.
So in this small series we skip this test when running on a buggy version of the driver.
Fixes#10763Closes#10766
* github.com:scylladb/scylla:
test/cql-pytest: skip another test on older, buggy, drivers
test/cql-pytest: de-duplicate code checking for an old buggy driver
update_history can take a long time compared to compaction, as a call
issued on shard S1 can be handled on shard S2. If the other shard is
under heavy load, we may unnecessarily block kicking off a new
compaction. Normally it isn't a problem, as compactions aren't super
frequent, but there were edge cases where the described behaviour caused
compaction to fail to keep up with excessive flushing, leading to too
many sstables on disk and OOM during a read.
There is no need to wait with next compaction until history is updated,
so release the weight earlier to remove unnecessary serialization.
Changelog:
v3:
- explicitly call deregister instead of moving the weight RAII object to release weight
- mark compaction as finished when sstables are compacted, without waiting for history to update
v2:
- Split the patches differently for easier review
- Rebased agains newer master, which contains fixes that failed the debug version of the test
- Removed the test, as it will be provided by [PR#10717](https://github.com/scylladb/scylla/pull/10717)
Closes#10507
* github.com:scylladb/scylla:
compaction: Release compaction weight before updating history.
compaction: Inline compact_sstables_and_update_history call.
compaction: Extract compact_sstables function
compaction: Rename compact_sstables to compact_sstables_and_update_history
compaction: Extract update_history function
compaction: Extract should_update_history function.
compaction: Fetch start_size from compaction_result
compaction: Add tracking start_size in compaction_result.
On 48b6aec16a we mistakenly allowed
check=True on systemd_unit.is_active(), it should be check=False.
We check unit's status by "systemctl is-active" output string,
it returns "active" or "inactive".
But systemctl command returns non-zero status when it returning
"inactive", so we are getting Exception here.
To fix this, we need new option "ignore_error=True" for out(),
and use it in systemd_unit.is_active().
Fixes#10455Closes#10467
After 93b765f655, our pull_github_pr.sh script tries to detect
a non-orthodox remote repo name, but it also adds an assumption
which breaks on some configurations (by some I mean mine).
Namely, the script tries to parse the repo name from the upstream
branch, assuming that current HEAD actually points to a branch,
which is not the way some users (by some I mean me) work with
remote repositories. Therefore, to make the script also work
with detached HEAD, it now has two fallback mechanisms:
1. If parsing @{upstream} failed, the script tries to parse
master@{upstream}, under the assumption that the master branch
was at least once used to track the remote repo.
2. If that fails, `origin/master` is used as last resort solution.
This patch allows some users (guess who) to get back to using
scripts/pull_github_pr.sh again without using a custom patched version.
Closes#10773
Clang up to version 13 supports the coroutines technical specification
(in std::experimental). 15 and above support standard coroutines (in
namespace std). Clang 14 supports both, but with a warning for the
technical specification coroutines.
To avoid the warning, change the threshold for selecting standard
coroutines from clang 15 to clang 14. This follow seastar commit
070ab101e2.
Closes#10647
column_maybe_subscripted is a variant<column_value*, subscript*> that
existed for two reasons:
1. evaluation of subscripts and of columns took different paths.
2. calculation of the type of column or column[sub] took different paths.
Now that all evaluations go through evaluate(), and the types are
present in the expression itself, there is no need for column_maybe_subscripted
and it is replaced with plain expressions.
Some functions accept the right-hand-side as the first argument
and the left-hand-side as the second argument. This is now confusing,
but at least safe-ish, as the arguments have different types. It's
going to become dangerous when we switch to expressions for both sides,
so let's rationalize it by always starting with lhs.
Some parameters were annotated with _lhs/_rhs when it was not clear.
The grammar only allows comparing tuples of clustering columns, which
are non-null, but let's not rely on that deep in expression evaluation
as it can be relaxed.
is_satisfied_by() used an internal column_value_eval_bag type that
was more awkwardly named (and more awkward to use due to more nesting)
than evaluation_inputs. Drop it and use evaluation_inputs throughout.
The thunk is_satisified_by(evaluation_inputs) that just called
is_satisified_by(column_value_eval_bag) is dropped.
Currently, evaluate() accepts only query_options, which makes
it not useful to evaluate columns. As a result some callers
(column_condition) have to call it directly on the right-hand-side
of binary expressions instead of evaluating the binary expression
itself.
Change it to accept evaluation_input as a parameter, but keep
the old signature too, since it is called from many places that
don't have rows.
is_satisfied_by() rearranges the static and regular columns from
query::result_row_view form (which is a use-once iterator) to
std::vector<managed_bytes_opt> (which uses the standard value
representation, and allows random access which expression
evaluation needs). Doing it in is_saitisfied_by() means that it is
done every time an expression is evaluated, which is wasteful. It's
also done even if the expression doesn't need it at all.
Push it out to callers, which already eliminates some calls.
We still pass cql3::expr::selection, which is a layering violation,
but that is left to another time.
Note that in view.cc's check_if_matches(), we should have been
able to move static_and_regular_columns calculation outside the
loop. However, we get crashes if we do. This is likely due to a
preexisting bug (which the zero iterations loop avoids). However,
in selection.cc, we are able to avoid the computation when the code
claims it is only handling partition keys or clustering keys.
Callers are converted, but the internals are kept using the old
conventions until more APIs are converted.
Although the new API allows passing no query_options, the view code
keeps passing dummy query_options and improvement is left as a FIXME.
An expression may refer to values provided externally: the partition
and clusterinng keys, the static and regular row (all providing
column values), and the query options (providing values for bind
variables). Currently, different evaluation functions
(evaluate(), get_value(), and is_satisfied_by()) receive different
subsets of these values.
As a first step towards unifying the various ways to evaluate an
expression, collect the parameters in a single structure. Since
different evaluation contexts have different subsets, make everything
optional (via a pointer). Note that callers are expected to verify
using the grammar or prepare phase that they don't refer to values
that are not provided.
The cql3::selection::selection parameter is provided to translate
from query::result_row_view to schema column indexes. This is pretty
bad since it means the translation needs to be done for every
evaluation and is therefore a candidate for removal, but is kept here
since that's how it's currently done.
Marking a test as flaky allows to keep running it in CI rather than disable it when it's discovered that a test is flaky.
Flaky tests, if they fail, show up as flaky in the output, but don't fail the CI.
```
kostja@hulk:~/work/scylla/scylla$ ./test.py cdc_with --repeat=30 --verbose
Found 30 tests.
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[1/30] cql debug [ FLKY ] cdc_with_lwt_test.2 9.36s
[2/30] cql debug [ FLKY ] cdc_with_lwt_test.1 9.53s
[3/30] cql debug [ PASS ] cdc_with_lwt_test.7 9.37s
[4/30] cql debug [ PASS ] cdc_with_lwt_test.8 9.41s
[5/30] cql debug [ PASS ] cdc_with_lwt_test.10 9.76s
[6/30] cql debug [ FLKY ] cdc_with_lwt_test.9 9.71s
```
Closes#10721
* github.com:scylladb/scylla:
test.py: add support for flaky tests
test.py: make Test hierarchy resettable
test.py: proper suite name in the log
test.py: shutdown cassandra-python connection before exit
The idea is that a flaky test can be marked as flaky
rather than disabled to make sure it passes in CI.
This reduces chances of a regression being added
while the flakiness is being resolved and the number
of disabled tests doesn't grow.
Introduce reset() hierarchy, which is similar to __init__(),
i.e. allows to reset test execution state before retrying it.
Useful for retrying flaky tests.
Use a nice suite name rather than an internal Python
object key in the log. Fixes a regression introduced
when addressing a style-related review remark.
Shutdown cassandra-python connections before exit, to avoid
warnings/exceptions at shutdown.
Cassandra-python runs a thread pool and if
connections are not shut down before exit, there could
be a warning that the thread pool is not destroyed
before exiting main.
CDC tables use a custom partitioner, which is not reflected in schema dumps (`CREATE TABLE ...`) and currently it is not possible to fix this properly, as we have no syntax to set the partitioner for a table. To work around this, the schema loader determines whether a table is a cdc table based on its name (does it end with `_scylla_cdc_table`) and sets the partitioner manually if it is the case.
Fixes: https://github.com/scylladb/scylla/issues/9840Closes#10774
* github.com:scylladb/scylla:
tools/schema_loader: add support for CDC tables
cdc/log.hh: expose is_log_name()
Change port type passed to Cassandra Python driver to int to avoid format errors in exceptions.
Manually shutdown connections to avoid reconnects after tests are done (required by upcoming async pytests).
Tests: (dev)
Closes#10722
* github.com:scylladb/scylla:
test.py: shutdown connection manually
test.py: fix port type passed to Cassandra driver
CDC tables use a custom partitioner, which is not reflected in schema
dumps (`CREATE TABLE ...`) and currently it is not possible to fix this
properly, as we have no syntax to set the partitioner for a table.
To work around this, the schema loader determines whether a table is a
cdc table based on its name (does it end with `_scylla_cdc_table`) and
sets the partitioner manually if it is the case.
Allow outside code to use it to determine whether a table is cdc or not.
This is currently the most reliable method if the custom partitioner is
not set on the schema of the investigated table.