Fixes#10489
Killing the CDC log table on CDC disable is unhelpful in many ways,
partly because it can cause random exceptions on nodes trying to
do a CDC-enabled write at the same time as log table is dropped,
but also because it makes it impossible to collect data generated
before CDC was turned off, but which is not yet consumed.
Since data should be TTL:ed anyway, retaining the table should not
really add any overhead beyond the compaction to eventually clear
it. And user did set TTL=0 (disabled), then he is already responsible
for clearing out the data.
This also has the nice feature of meshing with the alternator streams
semantics.
Closes#10601
The change
- adds a test which exposes a problem of a peculiar setup of
tombstones that trigger a mutation fragment stream validation exception
- fixes the problem
Applying tombstones in the order:
range_tombstone_change pos(ck1), after_all_prefixed, tombstone_timestamp=1
range_tombstone_change pos(ck2), before_all_prefixed, tombstone=NONE
range_tombstone_change pos(NONE), after_all_prefixed, tombstone=NONE
Leads to swapping the order of mutations when written and read from
disk via sstable writer. This is caused by conversion of
range_tombstone_change (in memory representation) to range tombstone
marker (on disk representation) and back.
When this mutation stream is written to disk, the range tombstone
markers type is calculated based on the relationship between
range_tombstone_changes. The RTC series as above produces markers
(start, end, start). When the last marker is loaded from disk, it's kind
gets incorrectly loaded as before_all_prefixed instead of
after_all_prefixed. This leads to incorrect order of mutations.
The solution is to skip writing a new range_tombstone_change with empty
tombstone if the last range_tombstone_change already has empty
tombstone. This is redundant information and can be safely removed,
while the logic of encoding RTCs as markers doesn't handle such
redundancy well.
Closes#10643
I noticed that `column_condition` (used in LWT `IF` clause) supports lists.
As part of the Grand Expression Unification we'll need to migrate that to
expressions, so we'll need to support list subscripts.
Use the opportunity to relax the normal filtering to allow filtering on
list subscripts: `WHERE my_list[:index] = :value`.
Closes#10645
* github.com:scylladb/scylla:
test: cql-pytest: add test for list subscript filtering
doc: document list subscripts usable in WHERE clause
cql3: expr: drop restrictions on list subscripts
cql3: expr: prepare_expr: support subscripted lists
cql3: expressions: reindent get_value()
cql3: expression: evaluate() support subscripting lists
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).
One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.
Closes#10699
This two-patch series makes two improvements to configure.py:
The first patch fixes, yet again, issue #4706 where interrupting ninja's rebuild of build.ninja can leave it without any build.ninja at all. The patch uses a different approach from the previous pull-request #10671 that aimed to solve the same problem.
The second patch makes the output of configure.py more reproducible, not resulting in a different random order every time. This is useful especially when debugging configure.py and wanting to check if anything changed in its output.
Closes#10696
* github.com:scylladb/scylla:
configure.py: make build.ninja the same every time
configure.py: don't delete build.ninja when rebuild is interrupted
* seastar 96bb3a1b8...2be9677d6 (37):
> Merge 'stream_range_as_array: always close output stream' from Benny Halevy
Fixes#10592
> net/api: add "server_socket::is_listening()"
> src/net/proxy: remove unused variable
> coroutine: parallel_for_each: relax contraints
> native-stack: do not use 0 as ip address if !_dhcp
> coroutine: fix a typo in comment
> std-coroutine: include for LLVM-14
> tutorial: use non-variadic version of when_all_succeed()
> scripts: Fix build.sh to use new --c++-standard config option
> core/thread: initialize work::pr and work::th explicitly
> util/log-impl: remove "const" qualifier in return type
> map_reduce: remove redundant move() in return statement
> util: mark unused parameter with [[maybe_unused]]
> drop unused parameters
> build: use "20" for the default CMAKE_CXX_STANDARD
> build: make CMAKE_CXX_STANDARD a string
> utils: log: don't crash on allocation failure while extending log buffer
> tests: unix_domain_test: fix thread/future confusion in client_round()
> compat: do not use std::source_location if it is broken
> build: use CMAKE_CXX_STANDARD instead of Seastar_CXX_DIALECT
> Merge 'Add hello-world demo from tutorial' from Pavel
> rpc_tester: Put client/server sides into correct sched groups
> reactor_backend: Use _r reference, not engine() method
> future.hh: #include std-compat.hh for SEASTAR_COROUTINES_ENABLED
> Merge "Add more CPU-hog facilities to RPC-tester" from Pavel E
> Merge "io: Enlighten queued_request" from Pavel E
> Correct swapped AIO detection/setup calls
> sharded: De-duplicate map-reduce overloads
> file: don't trample on xfs flags when setting xfs size hint
> Merge "Per-class IO bandwidth limits" from Pavel E
> Merge 'sstring: fix format and optimize the performance of sstring::find().' from Jianyong Chen
> reactor_backend: Mark reactor_backend_aio::poll() private
> scripts/build.sh: Mind if not running on a terminal
> test, rpc: Don't work with large buffers
> test, futures: Don't expect ready future to resolve immediately
> source_location compatibility: Fix an unused private field error when treat warning as errors
> file: Remove try-catch around noexcept calls
Currently it works, but the newer version of seastar's map_reduce()
is compiled in a way to trigger use-after-free on accessing captured
value.
tests: unit(dev), unit.alternator(debug on v1)
Fixes#10689
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220523095409.6078-1-xemul@scylladb.com>
When logging a failed checksum on a compressed chunk.
Currently, only the offset is logged, but the index of the chunk whose
checksum failed to validate is also interesting.
Closes#10693
In several places, configure.py uses unsorted sets which results in
its output being in different order every time - both a different
order of targets, and a different order in dependencies of each
target.
This is both strange, and annoying when trying to debug configure.py
and trying to understand when, if at all, its output changes.
So in this patch, we use "sorted(...)" in the right places that
are needed to guarantee a fixed order. This fixed order is alphabetical,
but that's not the goal of this patch - the goal is to ensure a fixed
order.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In commit 9cc9facbea, I fixed issue #4706.
That issue about what happens when interrupting a rebuild of build.ninja
(which happens automatically when you run "ninja" after configure.py
changed). We don't want to leave behind a half-built build.ninja,
or leave it deleted.
The solution in that commit was for configure.py to build a temporary file
(build.ninja.tmp), and only as the very last step rename it build.ninja.
Unfortunately, since that time, we added more last steps after what
used to be that very last step :-(
If this new code running after the rename takes a noticable amount of
time, and if the user is unlucky enough to interrupt it during that
time, ninja will see a modified output file (build.ninja) and a failed
rule, and will delete the output file!
The solution is to move the rename out of configure.py. Instead, we
add a "--out=filename" option to configure.py which allows it to write
directly to a different file name, not build.ninja. When rebuilding
build.ninja, the rule will now call configure.py with "--out=build.ninja.new"
and then rename it back to build.ninja. Any failure or interrupt at any
stage of configure.py will leave build.ninja untouched, so ninja will
not delete it - it will just delete the temporary build.ninja.new.
Fixes#4706 (again)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We modify the `reconfigure` and `modify_config` APIs to take a vector of
<server_id, bool> pairs (instead of just a vector of server_ids), where
the bool indicates whether the server is a voter in the modified config.
The `reconfiguration` operation would previously shuffle the set of
servers and split it into two parts: members and non-members. Now it
partitions it into three parts: voters, non-voters, and non-members.
The PR also includes fixes for some liveness problems stumbled upon
during testing.
Closes#10640
* github.com:scylladb/scylla:
test: raft: randomized_nemesis_test: include non-voters during reconfigurations
raft: server: if `add_entry` with `wait_type::applied` successfully returns, ensure `state_machine::apply` is called for this entry
raft: tracker: fix the definition of `voters()`
raft: when printing `raft::server_address`, include `can_vote`
The messages only dumps the last sealed fragment, but it should dump
all the output fragments replacing the exhausted input ones.
Let's print origin of output fragments, so we can differ between
files with compaction and garbage-collection origin.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220524232232.119520-1-raphaelsc@scylladb.com>
"
Azure snitch tries to replicate db/rack info from all shards to all
other shards. This may lead to use-after-free when shard A gets "this"
from shard B, starts copying its _dc field and the shard A destructs
its _dc from under B because it's receiving one from shard C.
Also polish replication code a little bit while at it.
"
* 'br-azure-snitch-serialize' of https://github.com/xemul/scylla:
snitch: Use invoke_on_others() to replicate
snitch: Merge set_my_dc and set_my_rack into one
azure_snitch: Do nothing on non-io-cpu
Restriction validation forbids lists (somewhat oddly, it talks about
indexes; validation should make a soft check about indexes (since it
can fall back to filtering) and a hard check about supported filtering
expressions), and enforces a map in another place. Remove the first
restriction and relax the second to allow lists as well as maps as
subscript operands.
Some validation messages are adjusted to reflect that lists are supported.
Infer the type of a list index as int32_type.
The error message when a non-subscriptable type is provided is
changed, so the corresponding test is changed too.
We already support subscripting maps (for filtering WHERE m[3] = 6),
so adding list subscript support is easy. Most of the code is shared.
Differences are:
- internal list representation is a vector of values, not of key/values
- key type is int32_type, not defined by map
- need to check index bounds
is not supported' from Nadav Har'El
This small series implements the DescribeTimeToLive and
DescribeContinuousBackups operations in Alternator. Even if the
corresponding features aren't implemented, it can help applications that
we implement just the Describe operation that can say that this feature
is in fact currently disabled.
Fixes #10660Closes#10670
* github.com:scylladb/scylla:
alternator: remove dead code
alternator: implement DescribeContinuousBackups operation
alternator: allow DescribeTimeToLive even without TTL enabled
This patch contains five tests which reproduce three old bugs in
Scylla's handling of multi-column restrictions like (c1,c2)<(1,2).
These old bugs are:
Refs #64 (yes, a two-digit issue!)
Refs #4244
Refs #6200
The three github issues are closely intertwined, exposing the same
or similar bugs in our internal implementation, and I suspect that
eventually most of them could be fixed together.
In writing these tests, I carefully read all three issues and the
various failure scenarios described in them, tried to distill and
simplify the scenarios, and also consider various other broken
variants of the scenarios. The resulting tests are heavily commented,
explaining the motivation of each test and exactly which of the
above bugs it reproduces.
All five tests included in this patch pass on Cassandra and currently
fail on Scylla, so are marked "xfail".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10675
It's an `int64_t` that needs to be explicitly initialized, otherwise the
value is undefined.
This is probably the cause of #10639, although I'm not sure - I couldn't
reproduce it (the bug is dependent on how the binary is compiled, so
that's probably it). We'll see if it reproduces with this fix, and if
it will, close the issue.
Closes#10681
"
The table keeps references on sstables_ and compaction_ managers
(among other things), but the latter sits as a pointer on table's
config while the former -- as on-table direct reference.
This set unifies both by turning sstables manager on-config pointer
into on-table reference.
branch: https://github.com/xemul/scylla/tree/br-table-vs-sstables-manager
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/574/
"
* 'br-table-vs-sstables-manager' of https://github.com/xemul/scylla:
tests: Remove sstables_manager& from column_family_test_config()
table: Move sstables_manager from config onto table itself
table, db, tests: Pass sstables_manager& into table constructor
`raft_group0` does not own the source and is not responsible for calling
`request_abort`. The source comes from top-level `stop_signal` (see
main.cc) and that's where it's aborted.
Fixes#10668.
Closes#10678
In issues #7944 and #10625 it was noticed that by assigning an empty
string to a non-string type (int, date, etc.) using INSERT or
INSERT JSON, some combinations of the above can create "empty" values
while they should produce a clear error.
The tests added in this patch explore the different combinations of
types and insert modes, and reproduce several buggy cases in Scylla
(resulting in xfail'ing tests) as well as Cassandra.
We feared that there might be a way using those buggy statements to
create a partition with an empty key - something which used to kill
older versions of Scylla. But the tests show that this is not possible -
while a user can use the buggy statements to create an empty value,
Scylla refuses it when it is used as a single-column partition key.
Refs #10625
Refs #7944
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10628
In 10dd08c9 ("messaging_service: supply and interpret rpc isolation_cookies",
4.2), we added a mechanism to perform rpc calls in remote scheduling groups
based on the connection identity (rather than the verb), so that
connection processing itself can run in the correct group (not just
verb processing), and so that one verb can run in different groups according
to need.
In 16d8cdadc ("messaging_service: introduce the tenant concept", 4.2), we
changed the way isolation cookies are sent:
scheduling_group
messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
@@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
if (must_compress) {
opts.compressor_factory = &compressor_factory;
}
opts.tcp_nodelay = must_tcp_nodelay;
opts.reuseaddr = true;
- opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+ // We send cookies only for non-default statement tenant clients.
+ if (idx > 3) {
+ opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+ }
This effectively disables the mechanism for the default tenant. As a
result some verbs will be executed in whatever group the messaging
service listener was started in. This used to be the main group,
but in 554ab03 ("main: Run init_server and join_cluster inside
maintenance scheduling group", 4.5), this was change to the maintenance
group. As a result normal read/writes now compete with maintenance
operations, raising their latency significantly.
Fix by sending the isolation cookie for all connections. With this,
a 2-node cassandra-stress load has 99th percentile increase by just
3ms during repair, compared to 10ms+ before.
Fixes#9505.
Closes#10673
The purpose of this series is to introduce infrastructure
for managed scylla processes into test.py,
switch some existing suites to use test.py managed processes
and introduce cluster tests.
All of this is expected to make possible to test Raft topology
changes and schema changes using an easy to use and fast tool
such as test.py. In general this will allow testing Scylla clusters
from within the development test harness.
Branch URL: kostja/test.py.v5
Closes#10406
* github.com:scylladb/scylla:
test: disable topology/test_null
test.py: disable cdc_with_lwt_test it's flaky in debug mode
test.py: workaround for a python bug
test: cleanup (drop keyspace) in two cql tests to support --repeat
test.py: respect --verbose even if output is a tty
test: remove tools/cql_repl
test.py: switch cql/ suite to pytest/tabular output
test: remove a flaky test case
test.py: implement CQL approval tests over pytest
test.py: implement cql_repl
test.py: add topology suite
test.py: add common utility functions to test/pylib
test.py: switch cql-pytest and rest_api suites to PythonTestSuite
test.py: introduce PythonTest and PythonTestSuite
test.py: use artifact registry
test.py: temporarily disable raft
test.py: (pylib) add Scylla Server and Artifact Registry
test.py: (pylib) add Host Registry to track used server hosts
test.py: (pylib) add a pool of scylla servers (or clusters)
The manager reference is already available in constructor and thus
can be copied to on-table member.
The code that chooses the manager (user/system one) should be moved
from make_column_family_config() into add_column_family() method.
Once this happens, the get_sstables_manager() should be fixed to
return the reference from its new location. While at it -- mark the
method in question noexcept and add it's mutable overload.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In core code there's only one place that constructs table -- in
database.cc -- and this place currently has the sstables_manager pointer
sitting on table config (despite it's a pointer, it's always non-null).
All the tests always use the manager from one of _env's out there.
For now the new contructor arg is unused.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We modify the `reconfigure` and `modify_config` APIs to take a vector of
<server_id, bool> pairs (instead of just a vector of server_ids), where
the bool indicates whether the server is a voter in the modified config.
The `reconfiguration` operation would previously shuffle the set of
servers and split it into two parts: members and non-members. Now it
partitions it into three parts: voters, non-voters, and non-members.
Previously it could happen that `add_entry` returned successfully but
`state_machine::apply` was never called by the server for this entry,
even though `wait_type::applied` was used, if the server loaded
a snapshot that contained this entry in just the right moment. Some
clients may find this behavior surprising, even though we may argue that
it's not technically incorrect.
For example, the nemesis test assumed that if `add_entry` returned
successfully (with `wait_type::applied`), the local state machine
applied the entry; the test uses `apply` to obtain an output - the
result of the command - from the state machine.
It's not a problem to give a stronger guarantee, so we do it in this
commit. In the scenario where a snapshot causes Raft to skip over the
entry, `add_entry` will finish exceptionally with
`commit_status_unknown`.
The previous implementation was weird, and it's not even clear if
the C++ standard defined what the result would be (because it used
`std::unordered_set::insert(iterator, iterator)`, where the iterators
pointed to a sequence of elements with elements that already had
equivalent elements in the set; cppreference does not specify which
elements end up in the set in this case).
In any case, in testing, the resulting set did not give the desired
result: if the configuration was joint, and a server was a voter in
the previous config but a non-voter in the current one, it would
not be a member of this set. This would cause the server to not vote for
itself when it became a candidate, which could lead to cluster
unavailability.
The new definition is simple: a server belongs to `voters()` iff it is
a voter in current or previous configuration. This fixes the problem
described above.
Fixes#10618.
This patch adds reproducing tests for wrong handling of LIMIT in a query
which uses a secondary index *and* filtering, described in issue #10649.
In that case, Scylla incorrectly limits the number of rows found in the
index *before* the filtering, while it should limit the number of rows
*after* the filtering.
The tests in this patch (which xfail on Scylla, and pass on Cassandra)
go beyond the minimum required to reproduce this bug. It turns out that
there are different sub-cases of this problem that go through different
code paths, namely whether the base table has clustering keys or just
partition keys, and whether the overall LIMITed result spans more than
one page. So these tests attempt to also cover all these sub-cases.
Without all these test sub-cases, an incomplete and incorrect fix of this
bug may, by chance, cause the original test to succeed.
Refs #10649
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#10658
Introduce a new operation, `raft_read`, which calls `read_barrier`
on a server, reads the state of the server's state machine, and returns
that state.
Extend the generator in `basic_generator_test` to generate `raft_read`s.
Only do it if forwarding is enabled (although it may make sense to test
read barriers in non-forwarding scenario as well - we may think about it
and do it in a follow-up).
Check the consistency of the read results by comparing them with the model
and using the result to extend the model with any newly observed elements.
The patchset includes some fixes for correctness (#10578)
and liveness (handling aborts correctly).
Closes#10561
* github.com:scylladb/scylla:
test: raft: randomized_nemesis_test: check consistency of reads
test: raft: randomized_nemesis_test: perform linearizable reads using read_barriers
test: raft: randomized_nemesis_test: add flags for disabling nemeses
raft: server: in `abort()`, abort read barriers before waiting for rpc abort
raft: server: handle aborts correctly in `read_barrier`
raft: fsm: don't advance commit index further than match_idx during read_quorum
Remove the function make_keyspace_name() that was never used.
We *could* have used this function, but we didn't, and it had
had an inconvenient API. If we later want to de-duplicate the
several copies of "executor::KEYSPACE_NAME_PREFIX + table_name"
we have in the code, we can do it with a better API.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although we don't yet support the DynamoDB API's backup features (see
issue #5063), we can already implement the DescribeContinuousBackups
operation. It should just say that continuous backups, and point-in-time
restores, and disabled.
This will be useful for client code which tries to inquire about
continuous backups, even if not planning to use them in practice
(e.g., see issue #10660).
Refs #5063
Refs #10660
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Scylla has a bug that only fires ones in a hundred runs in debug mode
when a schema change parallel to a topology change leads to a lost
keyspace and internal error. Disable the tests until Raft is enabled for
schema.
We still consider the TTL support in Alternator to be experimental, so we
don't want to allow a user to enable TTL on a table without turning on a
"--experimental-features" flag. However, there is no reason not to allow
the DescribeTimeToLive call when this experimental flag is off - this call
would simply reply with the truth - that the TTL feature is disabled for
the table!
This is important for client code (such as the Terraform module
described in issue #10660) which uses DescribeTimeToLive for
information, even when it never intends to actually enable TTL.
The patch is trivial - we simply remove the flag check in
DescribeTimeToLive, the code works just as before.
After this patch, the following test now works on Scylla without
experimental flags turned on:
test/alternator/run test_ttl.py::test_describe_ttl_without_ttl
Refs #10660
Signed-off-by: Nadav Har'El <nyh@scylladb.com>