Tests in test_tombstone_gc.py are parametrized with string instead
of bool values. Fix that. Use the value to create a keyspace with
or without tablets.
Fixes: #18888.
Closesscylladb/scylladb#18893
This patch adds a test for issue #18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.
Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.
The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.
The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before #18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.
Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.
This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.
Fixes: #17786Fixes: #18709Closesscylladb/scylladb#18816
The check is performed by selecting from mutation_fragments(table), but
it's known that this query crashes Scylla when there's no tablet replica
on that node.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the test changes RF from 2 to 3, the extra node executes "rebuild"
transition which means that it streams tablets replicas from two other
peers. When doing it, the node receives two sets of sstables with
mutations from the given tablet. The test part that checks if the extra
node received the mutations notices two mutation fragments on the new
replica and errorneously fails by seeing, that RF=3 is not equal to the
number of mutations found, which is 4.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit adds support for executing ALTER KS for keyspaces with
tablets and utilizes all the previous commits.
The ALTER KS is handled in alter_keyspace_statement, where a global
topology request in generated with data attached to system.topology
table. Then, once topology state machine is ready, it starts to handle
this global topology event, which results in producing mutations
required to change the schema of the keyspace, delete the
system.topology's global req, produce tablets mutations and additional
mutations for a table tracking the lifetime of the whole req. Tracking
the lifetime is necessary to not return the control to the user too
early, so the query processor only returns the response while the
mutations are sent.
This PR ensures that CDC keeps working correctly in the recovery
mode after leaving the raft-based topology.
We update `system.cdc_local` in `topology_state_load` to ensure
a node restarting in the recovery mode sees the last CDC generation
created by the topology coordinator.
Additionally, we extend the topology recovery test to verify
that the CDC keeps working correctly during the whole recovery
process. In particular, we test that after restarting nodes in the
recovery mode, they correctly use the active CDC generation created
by the topology coordinator.
Fixesscylladb/scylladb#17409Fixesscylladb/scylladb#17819Closesscylladb/scylladb#18820
* github.com:scylladb/scylladb:
test: test_topology_recovery_basic: test CDC during recovery
test: util: start_writes_to_cdc_table: add FIXME to increase CL
test: util: start_writes_to_cdc_table: allow restarting with new cql
storage_service: update system.cdc_local in topology_state_load
This patch fixes two bugs in `test_topology_ops`:
1. The values of `tablets_enabled` were nonempty strings, so they
always evaluated to `True` in the if statement responsible for
enabling writing workers only if tablets are disabled. Hence, the
writing workers were always disabled.
2. The `topology_experimental_raft suite` uses tablets by default,
so we need a config with empty `experimental_features` to disable
them.
Ensuring this test works with and without tablets is considered
a part of 6.0, so we should backport this patch.
Closesscylladb/scylladb#18900
If the user wants to change the default initial tablets value, it uses ALTER KEYSPACE statement. However, specifying `WITH tablets = { initial: $value }` will take no effect, because statement analyzer only applies `tablets` parameters together with the `replication` ones, so the working statement should be `WITH replication = $old_parameters AND tablets = ...` which is not very convenient.
This PR changes the analyzer so that altering `tablets` happens independently from `replication`. Test included.
fixes: #18801Closesscylladb/scylladb#18899
* github.com:scylladb/scylladb:
cql-pytest: Add validation of ALTER KEYSPACE WITH TABLETS
cql3: Fix parsing of ALTER KEYSPACE's tablets parameters
cql3: Remove unused ks_prop_defs/prepare_options() argument
before this change, we rely on `seastar/util/std-compat.hh` to
include the used headers provided by stdandard library. this was
necessary before we moved to a C++20 compliant standard library
implementation. but since Seastar has dropped C++17 support. its
`seastar/util/std-compat.hh` is not responsible for providing these
headers anymore.
so, in this change, we include the used headers directly instead
of relying on `seastar/util/std-compat.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18883
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.
Fixes#18085.
Closesscylladb/scylladb#18287
* github.com:scylladb/scylladb:
test: Fix flakiness in topology_experimental_raft/test_tablets
service: Use tablet read selector to determine which replica to account table stats
storage_service: Fix race between tablet split and stats retrieval
There's a test that checks how ALTER changes the initial tablets value,
but it equips the statement with `replication` parameters because of
limitations that parser used to impose. Now the `tablets` parameters can
come on their own, so add a new test. The old one is kept from
compatibility considerations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`test_topology_ops` is flaky, which has been uncovered by gating
in scylladb/scylladb#18707. However, debugging it is harder than it
should be because write workers can flood the logs. They may send
a lot of failed writes before the test fails. Then, the log file
can become huge, even up to 20 GB.
Fix this issue by stopping a write worker after the first error.
This test is important for 6.0, so we can backport this change.
Closesscylladb/scylladb#18851
Raft service levels are read-only in recovery mode. This patch adds check and proper error message when a user tries to modify service levels in recovery mode.
Fixes https://github.com/scylladb/scylladb/issues/18827Closesscylladb/scylladb#18841
* github.com:scylladb/scylladb:
test/auth_cluster/test_raft_service_levels: try to create sl in recovery
service/qos/raft_sl_dda: reject changes to service levels in recovery mode
service/qos/raft_sl_dda: extract raft_sl_dda steps to common function
Note we're suppressing a UBSanitizer overflow error in UTs. That's
because our linter complains about a possible overflow, which never
happens, but tests are still failing because of it.
In topology on raft, management of CDC generations is moved to the
topology coordinator. We extend the topology recovery test to verify
that the CDC keeps working correctly during the whole recovery
process. In particular, we test that after restarting nodes in the
recovery mode, they correctly use the active CDC generation created
by the topology coordinator. A node restarting in the recovery mode
should learn about the active generation from `system.cdc_local`
(or from gossip, but we don't want to rely on it). Then, it should
load its data from `system.cdc_generations_v3`.
Fixesscylladb/scylladb#17409
This patch allows us to restart writing (to the same table with
CDC enabled) with a new CQL session. It is useful when we want to
continue writing after closing the first CQL session, which
happens during the `reconnect_driver` call. We must stop writing
before calling `reconnect_driver`. If a write started just before
the first CQL session was closed, it would time out on the client.
We rename `finish_and_verify` - `stop_and_verify` is a better
name after introducing `restart`.
The system-distributed-keyspace and view-update-generator often go in pair, because streaming, repair and sstables-loader (via distributed-loader) need them booth to check if sstable is staging and register it if it's such. The check is performed by messing directly with system_distributed.view_build_status table, and the registration happens via view-update-generator.
That's not nice, other services shouldn't know that view status is kept in system table. Also view-update-generator is a service to generae and push view updates, the fact that it keeps staging sstables list is the implementation detail.
This PR replaces dependencies on the mentioned pair of services with the single dependency on view-builder (repair, sstables-loader and stream-manager are enlightened) and hides the view building-vs-staging details inside the view_builder.
Along the way, some simplification of repair_writer_impl class is done.
Closesscylladb/scylladb#18706
* github.com:scylladb/scylladb:
stream_manager: Remove system_distributed_keyspace and view_update_generator
repair: Remove system_distributed_keyspace and view_update_generator
streaming: Remove system_distributed_keyspace and view_update_generator
sstables_loader: Remove system_distributed_keyspace and view_update_generator
distributed_loader: Remove system_distributed_keyspace and view_update_generator
view: Make register_staging_sstable() a method of view_builder
view: Make check_view_build_ongoing() helper a method of view_builder
streaming: Proparage view_builder& down to make_streaming_consumer()
repair: Keep view_builder& on repair_writer_impl
distributed_loader: Propagate view_builder& via process_upload_dir()
stream_manager: Add view builder dependency
repair_service: Add view builder dependency
sstables_loader: Add view_bulder dependency
main: Start sstables loader later
repair: Remove unwanted local references from repair_meta
Database used to be (and still is in many ways) an object used to get configuration from. Part of the configuration is the set of pre-configured scheduling groups. That's not nice, services should use each other for some real need, not as proxies to configuration. This patch patches the places that explicitly switch to statement group _not_ to use database to get the group itself.
fixes: #17643Closesscylladb/scylladb#18799
* github.com:scylladb/scylladb:
database: Don't export statement scheduling group
test: Use async attrs and cql-test-env scheduling groups
test: Use get_scheduling_groups() to get scheduling groups
api: Don't switch sched group to start/stop protocol servers
main: Don't switch sched group to start protocol servers
code: Switch to sched group in request_stop_server()
code: Switch to server sched group in start()
protocol_server: Keep scheduling group on board
code: Add scheduling group to controllers
redis: Coroutinize start() method
in 4c1b6f04, we added a concept for fmt::is_formattable<>. but it
was not ncessary. the fmt::is_formattable<> trait was enough. the
reason 4c1b6f04 was actually a leftover of a bigger change which
tried to add trait for the cases where fmt::is_formattable<> was
not able to cover. but that was based on the wrong impression that
fmt::is_formattable<> should be able to work with container types
without including, for instance `fmt/ranges.h`. but in 222dbf2c,
we include `fmt/ranges.h` in tests, where the range-alike formatter
is used, that enables `fmt::is_formattable<>` to tell that container
types are formattable.
in short, 4c1b6f04 was created based on a misunderstanding, and
it was a reduced type trait, which is proved to be not necessary.
so, in this change, it is dropped. but the type constraints is
preserved to make the build failure more explicit, if the fallback
formatter does not match with the type to be formatted by Boost.test.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18879
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.
Fixes https://github.com/scylladb/scylladb/issues/18098Closesscylladb/scylladb#18769
this series disables operator<<:s for vector and unordered_map, and drop operator<< for mutation, because we don't have to keep it to work with these operator:s anymore. this change is a follow up of https://github.com/scylladb/seastar/issues/1544
this change is a cleanup. so no need to backport
Closesscylladb/scylladb#18866
* github.com:scylladb/scylladb:
mutation,db: drop operator<< for mutation and seed_provider_type&
build: disable operator<< for vector and unordered_map
db/heat_load_balance: include used header
test: define a more generic boost_test_print_type
test/boost: define fmt::formatter for service_level_controller_test.cc
test/boost: include test/lib/test_utils.hh
When selecting from mutation_fragments(table) one may want to apply
token() filtering againts partition key. This doesn't work currently,
but used to crash. This patch adds a regression test for that
refs: #18637
refs: #18768
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18759
fmt::is_formattable<T>::value is false, even if
* T is a container of U, and
* fmt::is_formattable<U>, and
* U can be formatted using fmt::formatter
so, we have to define a more generic boost_test_print_type()
for the all types supported by {fmt}. it will help us to ditch the
operator<< for vector and unordered_map in Seastar, and allow us
to use the fmt::formatter specialization of the element
types.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we are moving away for operator<< based formatter, more and more
types now only have {fmt} based formatters. the same will apply to the
STL container types after ditching the generic homebrew formatter in
to_string.hh, so to be prepared for the change, let's add the
fmt::formatter for tests as well.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change was created in the same spirit of 505900f18f. because
we are deprecating the operator<< for vector and unorderd_map in
Seastar, some tests do not compile anymore if we disable these
operators. so to be prepared for the change disabling them, let's
include test/lib/test_utils.hh for accessing the printer dedicated
for Boost.test. and also '#include <fmt/ranges.h>' when necessary,
because, in order to format the ranges using {fmt}, we need to
use fmt/ranges.h.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The modified lines of code intend to await the first appearance of a log
on one of the nodes.
But due to misplaced parentheses, instead of creating a list of log-awaiting
tasks with a list comprehension, they pass a generator expression to
asyncio.create_task().
This is nonsense, and it fails immediately with a type error.
But since they don't actually check the result of the await,
the test just assumes that the search completed successfully.
This was uncovered by an upgrade to Python 3.12, because its typing is stronger
and asyncio.create_task() screams when it's passed a regular generator.
This patch fixes the bad list comprehension, and also adds an error check
on the completed awaitables (by calling `await` on them).
Fixes#18740Closesscylladb/scylladb#18754
Continuation of the prevuous patch, but with its own flavor. There's a
manual test that wants to run seastar thread in statement scheduling
group and gets one from database. This patch makes it get the group from
cql-test-env and, while at it, makes it switch to that group using
thread attributes passed to async() method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's such a helper in cql-test-env that other tests use to get sched
groups from. Few other tests (ab)use databse for that, this patch fixes
those remnants.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One source of flakiness is in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode
due to gossiper being aborted prematurely, and causing reconnection
storm.
Another is test_tablet_missing_data_repair which is flaky due an issue
in python driver that session might not reconnect on rolling restart
(tracked by https://github.com/scylladb/python-driver/issues/230)
Refs #15356.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If tablet split is finalized while retrieving stats, the saved erm, used by all
shards, will be invalidated. It can either cause incorrect behavior or
crash if id is not available.
It's worked by feeding local tablet map into the "coordinator"
collecting stats from all shards. We will also no longer have a snapshot
of erm shared between shards to help intra-node migration. This is
simplified by serializing token metadata changes and the retrieval of
the stats (latter should complete pretty fast, so it shouldn't block
the former for any significant time).
Fixes#18085.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As part of the Alternator test suite, we check Alternator's support for
authentication. Alternator maps Scylla's existing CQL roles to AWS's
authentication:
* AWS's access_key_id <- the name of the CQL role
* AWS's secret_access_key <- the salted hash of the password of the CQL role
Before this patch, the Alternator test suite created a new role with a
preset salted hash (role "alternator", salted hash "secret_pass")
and than used that in the tests. However, with the advent of Raft-based
metadata it is wrong to write directly to the roles table, and starting
with #17952 such writes will be outright forbidden.
But we don't actually need to create a new CQL role! We already have
a perfectly good CQL role called "cassandra", and our tests already use
it. So what this patch does is to have the Alternator tests (conftest.py)
read from the roles system-table the salted hash of the "cassandra" role,
and then use that - instead of the hard-coded pair alternator/secret_pass -
in the tests.
A couple more tests assumed that the role name that was used was
"alternator", but now it was changed to "cassandra" so those tests
needed minor fixes as well.
After this patch, the Alternator tests no longer *write* to the roles
system table. Moreover, after this patch, test/alternator/run and
test/alternator/suite.yaml (used when testing with test.py) no longer
need to do extra ugly CQL setup before starting the Alternator tests.
Fixes#18744
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18771
There's a test that checks system.tablets contents to see that after
changing ks replication factor via ALTER KEYSPACE the tablet map is
updated properly. This patch extends this test that also validates that
mutations themselves are replicated according to the desired replication
factor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18644
The sl-controller is stopped in three steps. The first (and instantly the second) is unsubscribing from lifecycle notification and draining. The third is stop itself. First two steps are "out of order" as compared to the desired start-stop sequence of any service, this patch fixes these steps.
After this PR the drain_on_shutdown() (the call that drains the node upon stop) finally becomes clean and tidy and is no longer accompanied by ad-hoc fellow drains/stops/aborts/whatever.
refs: #2737Closesscylladb/scylladb#18731
* github.com:scylladb/scylladb:
sl_controller: Remove drain() method
sl_controller: Move abort kicking into do_abort()
main,sl_controller: Subscribe for early abort
main: Unsubscribe sl controller next to subscribing
C++ standard does not define the order in which the parameters
passed to a function are evaluated. so in theory, in
```c++
reusable_sst(sst->get_schema(), std::move(sst));
```
`std::move(sst)` could be evaluated before `sst->get_schema`.
but please note, `std::move(sst)` does not move `sst`
away, it merely cast `sst` to a rvalue reference, it is
`reusable_sst()` which *could* move `sst` away by
consuming it. so following call is much more dangerous
than the above one:
```c++
reusable_sst(sst->get_schema(), modify_sst(std::move(sst)))
```
nevertheless, this usage is still confusing. so instead
of passing a copy of `sst` to `reusable_sst`.
this change is inspired by clang-tidy, it warns like:
```
Warning: /home/runner/work/scylladb/scylladb/test/lib/test_services.cc:397:25: warning: 'sst' used after it was moved [bugprone-use-after-move]
397 | return reusable_sst(sst->get_schema(), std::move(sst));
| ^
/home/runner/work/scylladb/scylladb/test/lib/test_services.cc:397:44: note: move occurred here
397 | return reusable_sst(sst->get_schema(), std::move(sst));
| ^
/home/runner/work/scylladb/scylladb/test/lib/test_services.cc:397:25: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
397 | return reusable_sst(sst->get_schema(), std::move(sst));
|
```
per the analysis above, this is a false alarm.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18775
There's stop-signal in main that fires an abort source on stop. Lots of
other services are subscribed in it, add the sl-controller too. For now
it's a no-op, but next patches will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
User-defined types can depend on each other, creating directed acyclic graph.
In order to support restoring schema from `DESC SCHEMA`, UDTs should be
ordered topologically, not alphabetically as it was till now.
This patch changes the way UDTs are ordered in `DESC SCHEMA`/`DESC KEYSPACE <ks>` statements, so the output can be safely copy-pasted to restore the schema.
Fixes#18539Closesscylladb/scylladb#18302
* github.com:scylladb/scylladb:
test/cql-pytest/test_describe: add test for UDTs ordering
cql3/statements/describe_statement: UDTs topological sorting
cql3/statements/describe_statement: allow to skip alphabetical sorting
types: add a method to get all referenced user types
db/cql_type_parser: use generic topological sorting
db/cql_type_parses: futurize raw_builder::build()
test/boost: add test for topological sorting
utils: introduce generic topological sorting algorithm
PR https://github.com/scylladb/scylladb/pull/18186 introduced a fiber that reloads reclaimed bloom filters when memory becomes available. Use maintenance scheduling group to run that fiber instead of running it in the main scheduling group.
Fixes#18675Closesscylladb/scylladb#18721
* github.com:scylladb/scylladb:
sstables_manager: use maintenance scheduling group to run components reload fiber
sstables_manager: add member to store maintenance scheduling group
This PR resolves issue with double count of the test result for topology tests. It will not appear in the consolidated report anymore.
Another fix is to provide a better view which test failed by modifying the test case name in the report enriching it with mode and run id, so making them unique across the run.
The scope of this change is:
1. Modify the test name to have run id in name
2. Add handlers to get logs of test.py and pytest in one file that are related to test, rather than to the full suite
3. Remove topology tests from aggregating them on a suite level in Junit results
4. Add a link to the logs related to the failed tests in Junit results, so it will be easier to navigate to all logs related to test
5. Gather logs related to the failed test to one directory for better logs investigation
Ref: scylladb/scylladb#17851Closesscylladb/scylladb#18277