This PR introduces a new file data source implementation for uncompressed SSTables that will be validating the checksum of each chunk that is being read. Unlike for compressed SSTables, checksum validation for uncompressed SSTables will be active for scrub/validate reads but not for normal user reads to ensure we will not have any performance regression.
It consists of:
* A new file data source for uncompressed SSTables.
* Integration of checksums into SSTable's shareable components. The validation code loads the component on demand and manages its lifecycle with shared pointers.
* A new `integrity_check` flag to enable the new file data source for uncompressed SSTables. The flag is currently enabled only through the validation path, i.e., it does not affect normal user reads.
* New scrub tests for both compressed and uncompressed SSTables, as well as improvements in the existing ones.
* A change in JSON response of `scylla validate-checksums` to report if an uncompressed SSTable cannot be validated due to lack of checksums (no `CRC.db` in `TOC.txt`).
Refs #19058.
New feature, no backport is needed.
Closesscylladb/scylladb#20207
* github.com:scylladb/scylladb:
test: Add test to validate SSTables with no checksums
tools: Fix typo in help message of scylla validate-checksums
sstables: Allow validate_checksums() to report missing checksums
test: Add test for concurrent scrub/validate operations
test: Add scrub/validate tests for uncompressed SSTables
test/lib: Add option to create uncompressed random schemas
test: Add test for scrub/validate with file-level corruption
test: Check validation errors in scrub tests
sstables: Enable checksum validation for uncompressed SSTables
sstables: Expose integrity option via crawling mutation readers
sstables: Expose integrity option via data_consume_rows()
sstables: Add option for integrity check in data streams
sstables: Remove unused variable
sstables: Add checksum in the SSTable components
sstables: Introduce checksummed file data source implementation
sstables: Replace assert with on_internal_error
Fixes#20543
In cql_test_env, if cfg_in.ms_listen is set, we try to get a free port for the current test on
which message service rpc can bind. This to allow multiple tests in parallel.
However, we just do this by using random and getting a number, not actually verifying it against
host ports in use.
This is complicated further by the fact that port reuse is effectively disabled in seastar
(see reactor::posix_reuseport_detect()). Due to this, the solution applied here is a combo
of
* Create temp socket with port = 0 to get a previously free port
* Close socket right before listen (to handle reuse not working)
* Retry on EADDRINUSE
Closesscylladb/scylladb#20547
Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations.
The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table.
New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table.
The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility.
When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows.
Fixesscylladb/scylladb#15329
dtest https://github.com/scylladb/scylla-dtest/pull/4827Closesscylladb/scylladb#19745
* github.com:scylladb/scylladb:
view: test view_build_status table with node replace
test/pylib: use view_build_status_v2 table in wait_for_view
view_builder: common write view_build_status function
view_builder: improve migration to v2 with intermediate phase
view: delete node rows from view_build_status on node removal
view: sanitize view_build_status during migration
view: make old view_build_status table a virtual table
replica: move streaming_reader_lifecycle_policy to header file
view_builder: test view_build_status_v2
storage_service: add view_build_status to raft snapshot
view_builder: migration to v2
db:system_keyspace: add view_builder_version to scylla_local
view_builder: read view status from v2 table
view_builder: introduce writing status mutations via raft
view_builder: pass group0_client and qp to view_builder
view_builder: extract sys_dist status operations to functions
db:system_keyspace: add view_build_status_v2 table
In a previous patch we extended the return status of
`sstables::validate_checksums()` to report if an SSTable cannot be
validated due to a missing CRC component (i.e., CRC.db does not appear
in TOC.txt).
Add a test case for this.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Change the return type of `sstable::validate_checksums()` from binary
(valid/invalid) to a ternary (valid/invalid/no_checksums). The third
status represents uncompressed SSTables without a CRC component (no
entry for CRC.db in the TOC).
Also, change the JSON response of `sstable validate-checksums` to expose
the new status. Replace the boolean value for valid/invalid checksums
with an object that contains two boolean keys: one that indicates if the
SSTable has checksums, and one that indicates if the checksums are valid
or not. The second key is optional and appears only if the SSTable has
checksums.
Finally, update the documentation to reflect the changes in the API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Theoretically it is possible to launch more than one scrub instances
simultaneously. Since the checksum component is a shared resource,
accesses have to be synchronized.
Add a test that launches two scrub operations in validate mode and
ensures that the checksum component is loaded once, referenced by all
scrub instances via shared pointers, and deleted once the scrub
operations finish. Introduce an injection point to achieve concurrent
execution of scrubs.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Currently the unit tests check scrub in validate mode against compressed
SSTables only. Mirror the tests for uncompressed SSTables as well.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend the `random_schema_specification` to support creating both
compressed and uncompressed schemas.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Currently, we test scrub/validate only against a corrupted SSTable with
content-level corruption (out-of-order partition key).
Add a test for file-level corruption as well. This should trigger the
checksum check in the underlying compressed file data source
implementation.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Scrub was extended in PR #11074 to report validation errors but the
unit tests were not updated.
Update the tests to check the validation errors reported by scrub.
Validation errors must be zero for valid SSTables and non-zero for
invalid SSTables.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.
Fixes#20301
Requires a backport to 6.0 and 6.1 as they have the same issue.
Closesscylladb/scylladb#20471
* github.com:scylladb/scylladb:
cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
database::get_all_tables_flushed_at: fix return value
Refs #18161
Yet another approach to dealing with large commitlog submissions.
We handle oversize single mutation by adding yet another entry
typo: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.
Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
*
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.
On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.
Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.
This approach bumps the file format (docs to come).
To ensure "atomicity" we both force synchronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.
Closesscylladb/scylladb#19472
* github.com:scylladb/scylladb:
commitlog/database: Make some commitlog options updatable + add feature listener
features/config: Add feature for fragmented commitlog entries
docs: Add entry on commitlog file format v4
commitlog_test: Add more oversized cases
commitlog_replayer: Replay segments in order created
commitlog_replayer: Use replay state to support fragmented entries
commitlog_replayer: coroutinize partly
commitlog: Handle oversized entries
The test-env in question is mostly started in one-shard mode. Also there
are several boost tests that start sharded<> environment. In that case
instances on different shards live in different temp dirs. That's not
critical yet, but better to have single directory for the whole test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20412
In test.py every asyncio task spawned during the test must be finished before the next test, otherwise, tests might affect each other results.
The developers are responsible for writing asyncio code in a way that doesn’t leave task objects unfinished.
Test.py has a mechanism that helps test writers avoid such tasks. At the end of each test case, it verifies that the test did not produce/leave any tasks and sets an event object that fails the next test at the start if this is the case(issue https://github.com/scylladb/scylladb/issues/16472)
The problem with this was that breaking the next test was counterintuitive, and the logging for this situation was insufficient and unobvious.
notes: Task.cancel() is not an option to avoid task leakage
1) Calling cancel() Does Not Cancel The Task : the cancel() method just request that the target task cancel.
2) Calling cancel() Does Not Block Until The Task is Cancelled: If the caller needs to know the task is cancelled and done, it could await for the target
3) In particular PR, task.cancel() cancell task on client(ManagerClient) but not on http server(ScyllaManager). so "await" is needed.
Closesscylladb/scylladb#20012
The test in question generates a bunch of table_for_tests objects and
creates sstables for each. For that it calls test_env::make_sstable(),
but it can be made shorter, by calling table method directly.
The hidden goal of this change is to remove the explicit caller of
table::dir() method. The latter is going away.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20451
This includes
- coroutinization
- elimination of unused overload
Closesscylladb/scylladb#20456
* github.com:scylladb/scylladb:
test: Squash two open_sstables() helper together
test: Coroutinize open_sstables() helper
This series prepares us for working on #11567 - allow adding a GSI to a pre-existing table. This will require changing the implementation of GSIs in Alternator to not use real columns in the schema for the materialized view, and instead of a computed column - a function which extracts the desired member from the `:attrs` map and de-serializes it.
This series does not contain the GSI re-implementation itself. Rather it contains a few small cleanups and mostly - new regression tests that cover this area, of adding and removing a GSI, and **using** a GSI, in more details than the tests we already had. I developed most of these tests while working on **buggy** fixes for #11567; The bugs in those implementations were exposed by the tests added here - they exposed bugs both in the new feature of adding or removing a GSI, and also regressions to the ordinary operation of GSI. So these tests should be helpful for whoever ends up fixing #11567, be it me based on my buggy implementation (which is _not_ included in this patch series), or someone else.
No backports needed - this is part of a new feature, which we don't usually backport.
Closesscylladb/scylladb#20383
* github.com:scylladb/scylladb:
test/alternator: more extensive tests for GSI with two new key attributes
test/alternator: test invalid key types for GSI
test/alternator: test combination of LSI and GSI
test/alternator: expand another test to use different write operations
test/alternator: test GSIs with different key types
alternator: better error message in some cases of key type mismatch
test/alternator: test for more elaborate GSI updates
test/alternator: strengthen tests for empty attribute values
test/alternator: fix typo in test_batch.py
test/alternator: more checks for GSI-key attribute validation
Alternator: drop unneeded "IS NOT NULL" clauses in MV of GSI/LSI
test/alternator: add more checks for adding/deleting a GSI
test/alternator: ensure table deletions in test_gsi.py
There are some test cases in sstable_directory_test test actually create
a table with CQL and then try to manipulate its sstables with the help
of sstable_directory. Those tests use existing local helper that starts
sharded<sstable_directory> and this helper passes test-local static
schema to sstable_directory constructor. As a result -- the schema of a
table that test case created and the schema that sstable_directory works
with are different. They match in the columns layout, which helps the
test cases pass, but otherwise are two different schema objects with
different IDs. It's more correct to use table schema for those runs.
The fix introduces another helper to start sharded<sstable_directory>,
and the older wrapper around cql_test_env becomes unused. Drop it too
not to encourage future tests use it and re-introduce schema mismatch
again.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20499
Any expired tombstone can be garbage collected if it doesn't shadow data in the commit log, memtable, or uncompacting SSTables.
This PR introduces a new mode to major compaction, enabled by the `consider_only_existing_data` flag that bypasses these checks. When enabled, memtables and old commitlog segments are cleared with a system-wide flush and all the sstables (after flush) are included in the compaction, so that it works with all data generated up to a given time point.
This new mode works with the assumption that newly written data will not be shadowed by expired tombstones. So it ignores new sstables (and new data written to memtable) created after compaction started. Since there was a system wide flush, commitlog checks can also be skipped when garbage collecting tombstones. Introducing data shadowed by a tombstone during compaction can lead to undefined behavior, even without this PR, as the tombstone may or may not have already been garbage collected.
Fixes#19728Closesscylladb/scylladb#20031
* github.com:scylladb/scylladb:
cql-pytest: add test to verify consider_only_existing_data compaction option
tools/scylla-nodetool: add consider-only-existing-data option to compact command
api: compaction: add `consider_only_existing_data` option
compaction: consider gc_check_only_compacting_sstables when deducing max purgeable timestamp
compaction: do not check commitlog if gc_check_only_compacting_sstables is enabled
tombstone_gc_state: introduce with_commitlog_check_disabled()
compaction: introduce new option to check only compacting sstables for gc
compaction: rename maybe_flush_all_tables to maybe_flush_commitlog
compaction: maybe_flush_all_tables: add new force_flush param
Currently, when attempting to send a hint, we might choose its recipients in one of two ways:
- If the original destination is a natural endpoint of the hint, we only send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.
There is a problem when we decommission a node: while data is streamed away from that node, it is still considered to be a natural endpoint of the data that it used to own. Because of that, it might happen that a hint is sent directly to it but streaming will miss it, effectively resulting in the hint being discarded.
As sending the hint _only_ to the leaving replica is a rather bad idea, send the hint to all replicas also in the case when the original destination of the hint is leaving.
Note that this is a conservative fix written only with the decommission + vnode-based keyspaces combo in mind. In general, such "data loss" can occur in other situations where the replica set is changing and we go through a streaming phase, i.e. other topology operations in case of vnodes and tablet load balancing. However, the consistency guarantees of hinted handoff in the face of topology changes are not defined and it is not clear what they should be, if there should be any at all. The picture is further complicated by the fact that hints are used by materialized views, and sending view updates to more replicas than necessary can introduce inconsistencies in the form of "ghost rows". This fix was developed in response to a failing test which checked the hint replay + decommission scenario, and it makes it work again.
Fixesscylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
Should be backported to 6.0 and 6.1; the dtest started failing due to topology on raft, which sped up execution of the test and exposed the preexisting problem.
Closesscylladb/scylladb#20488
* github.com:scylladb/scylladb:
test: topology_custom/test_hints: consistency test for decommission
test: topology_custom/test_hints: move sync point helpers to top level
test: topology/util: extract find_server_by_host_id
hints: send hints with CL=ALL if target is leaving
hints: inline do_send_one_mutation
In sstable directory test there are two of those -- one that works on
path, state, env and callback, and the other one that just needs env and
callback, getting path from env and assuming state is normal.
Two test cases in this test can enjoy the shorter one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20395
c70f321c6f added an extra check if KS
exists. This check can throw `data_dictionary::no_such_keyspace`
exception, which is supposed to be caught and a more user-friendly
exception should be thrown instead.
This commit fixes the above problem and adds a testcase to validate it
doesn't appear ever again.
Also, I moved the check for the keyspace outside of the `for` loop, as
it doesn't need to be checked repeatedly.
Fixes: scylladb/scylladb#20097Closesscylladb/scylladb#20404
The case of a GSI with two key attributes (hash and range) which were both
not keys in the base table is a special case, not supported by CQL but
allowed in Alternator. We have several tests for this case, but they don't
cover all the strange possibilities that a GSI row disappears / reappears
when one or two of the attributes is updated / inserted / deleted.
So this patch includes a more extensive test for this case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test that types which are not allowed for GSI keys -
basically any type except S(tring), B(ytes) or N(number), are rejected
as expected - an error path that we didn't cover in existing tests.
The new test passes - Alternator doesn't have a bug in this area, and
as usual, also passes on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To allow adding a GSI to an existing table (refs #11567), we plan to
re-implement GSIs to stop forcing their key attribute to become a real
column in the schema - and let it remains a member of the map ":attrs"
like all non-key attributes. But since LSIs can only be defined on table
creation time, we don't have to change the LSI implementation, and these
can still force their key to become a real column.
What the test in this patch does is to verify that using the same
attribute as a key of *both* GSI and LSI on the same table works.
There's a high risk that it won't work: After all, the LSI should force the
attribute to become a real column (to which base reads and writes go), but
the GSI will use a computed column which reads from ":attrs", no? Well,
it turns out that view.cc's value_getter::operator() always had a
surprising exception which "rescues" this test and makes it pass: Before
using a computed column, this code checks if a base-table column with the
same name exists, and if it does, it is used instead of the computed column!
It's not clear why this logic was chosen, but it turns out to be really
useful for making the test in this test pass. And it's important that if
we ever change that unintuitive behavior, we will have this test as a
regression test.
The new test unsurprisingly passes on current Scylla because its
implementation of GSI and LSI is still the same. But it's an important
regression test for when we change the GSI implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Expand another Alternator test (test_gsi.py::test_gsi_missing_attribute)
to write items not just using PutItem, but also using UpdateItem and
BatchWriteItem. There is a risk that these different operations use
slightly different code paths - so better check all of them and not
just PutItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All of the tests in test/alternator/test_gsi.py use strings as the GSI's
keys. This tests a lot of GSI functionality, but we implicitly assumed that
our implementation used an already-correct and already-tested implementation
of key columns and MV, which if it works for one type, works for other types
as well.
This assumption will no longer hold if we reimplement GSI on a "computed
column" implementation, which might run different code for different types
of GSI key attributes (the supported types are "S"tring, "B"ytes, and
"N"umber).
So in this patch we add tests for writing and reading different types of
GSI key attributes. These tests showed their importance as regression
tests when the first draft of the GSI reimplementation series failed them.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Alternator uses a common function get_typed_value() to read the values
of key attribute and confirm they have the expected type (key attributes
have a fixed type in the schema). If the type is wrong, we want to print
a "Type mismatch" error message.
But the current implementation did the checks in the wrong order, and
as a result could print a "Malformed value object" message instead of a
"Type mismatch". That could happen if the wrong type is a boolean, map,
list, or basically any type whose JSON representation is not a string.
The allowed key types - bytes), string and number - all have string
representations in JSON, but still we should first report the mismatched
type and only report the "Malformed object" if the type matches but the
JSON is faulty.
In addition to fixing the error message, we fix an existing test which
complained in a comment (but ignored) that the error message in some
case (when trying to use a map where a key is expected) the strange
"Malformed value object" instead of the expected "Type mismatch".
The next patch will add an additional reproducer for this problem and
its fix. That test will do:
```
with pytest.raises(ClientError, match='ValidationException.*mismatch'):
test_table_gsi_6.put_item(Item={'p': p, 's': True})
```
I.e., it tries to set a boolean value for a string key column, and
expect to get the "Type mismatch" error and not the ugly "Malformed
value object".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Most tests in test_gsi.py involve simple updates to a GSI, just
creating a GSI row. Although a couple of tests did involve more
complex operations (such as an update requiring deleting an old row
from the GSI and inserting a new one,), we did not have a single
organized test designed to check all these cases, so we add one in
this patch.
This test (test_update_gsi_pk) will be important for verifying
the low-level implementation of the new GSI implementation that
we plan to based on computed columns. Early versions of that code
passed many of the simpler tests, but not this one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We soon plan to refactor Alternator's GSI and change the validation of
values set in attributes which are GSI keys. It's important to test that
when updating attributes that are *not* GSI keys - and are either base-
table keys or normal non-key attributes - the validation didn't change.
For example, empty strings are still not allowed in base-table key
attributes, but are allowed (since May 2020 in DynamoDB) in non-key
attributes.
We did have tests in this area, but this patch strengthens them -
adding a test for non-key attribute, and expanding the key-attribute
test to cover the UpdateItem and BatchWriteItem operations, not just
PutItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To enhance the test reports UX:
1. switching off/on passed/failed/skipped test for better visibility
2. better searching in test results
3. understanding the trends of execution for each test
4. better configurability of the final report
Enable allure adapter for all python tests.
Add tags and parameters to the test to be able to distinguish them across modes and runs.
Related: https://github.com/scylladb/qa-tasks/issues/1665
Related: https://github.com/scylladb/scylladb/pull/19335
Related: https://github.com/scylladb/scylladb/pull/18169Closesscylladb/scylladb#19942
* github.com:scylladb/scylladb:
[test.py] Clean duplicated arg for test suite
[test.py] Enable allure for python test
Two tests had a typo 'item' instead of 'Item'. If Scylla had a bug, this
could have caused these tests to miss the bug.
Scylla passes also the fixed test, because Scylla's behavior is correct.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When an attribute is a GSI key, DynamoDB imposes certain rules when
writing values for it - it must be of the declared type for that key,
and can't be an empty string. We had tests for this, but all of them
did the write using the PutItem operation.
In this patch we also test the same things using the UpdateItem and
BatchWriteItem operations. Because Scylla has different code paths
for these three operations, and each code path needs to remember to
call the validation function, all three should all be checked and not just
PutItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have tests for the feature of adding or removing a GSI from
an existing table, which Alternator doesn't yet support (issue #11567).
In this patch we add another check, how after a GSI is added, you can
no longer add items with the wrong type for the indexed type, and after
removing a GSI, you can. The expanded tests pass on DynamoDB, and
obviously still xfail on Alternator because the feature is not yet
implemented.
Refs #11567.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Most of the Alternator tests are careful to unconditionally remove the test
tables, even if the test fails. This is important when testing on a shared
database (e.g., DynamoDB) but also useful to make clean shutdown faster
as there should be no user table to flush.
We missed a few such cases in test_gsi.py, and fixed some of them in
commit 59c1498338 but still missed a few,
and this patch fixes some more instances of this problem.
We do this by using the context manager new_test_table() - which
automatically deletes the table when done - instead of the function
create_test_table() which needs an explicit delete at the end.
There are no functional changes in this patch - most of the lines
changed are just reindents.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Adds the test_hints_consistency_during_decommission test which
reproduces the failure observed in scylladb/scylla-dtest#4582. It uses
error injections, including the newly added
topology_coordinator_pause_after_streaming injection, to reliably
orchestrate the scenario observed there.
In a nutshell, the test makes sure to replay hints after streaming
during decommission has finished, but before the cluster switches to
reading from new replicas. Without the fix, hints would be replayed to
the decommissioned node and then would be lost forever after the cluster
start reading from new replicas.
Move create_sync_point and await_sync_point from the scope of the
test_sync_point test to the file scope. They will be used in a test that
will be introduced in the commit that follows.
The idea of the test is to have a cluster where one node is stressed with injections and failures and the rest of the cluster is used to make progress of the raft state machine.
To achieve this following two lists introduced in the PR:
- ERROR_INJECTIONS in error_injections.py
- CLUSTER_EVENTS in cluster_events.py
Each cluster event is an async generator which has 2 yields and should be used in the following way:
0. Start the generator:
```python
>>> cluster_event_steps = cluster_event(manager, random_tables, error_injection)
```
1. Run the prepare part (before the first yield)
```python
>>> await anext(cluster_event_steps)
```
2. Run the cluster event itself (between the yields)
```python
>>> await anext(cluster_event_steps)
```
3. Run the check part (after the second yield)
```python
>>> await anext(cluster_event, None)
```
Closesscylladb/scylladb#16223
* github.com:scylladb/scylladb:
test: randomized failure injection for Raft-based topology
test: error injections for Raft-based topology
[test.py] topology.util: add get_non_coordinator_host() function
[test.py] random_tables: add UDT methods
[test.py] random_tables: add CDC methods
[test.py] api: get scylla process status
[test.py] api: add expected_server_up_state argument to server_add()
The test cases in this file use an error injection to reduce raft group
0 timeouts (from the default 1 minute), in order to speed up the tests;
the scenarios expect these timeouts to happen, so we want them to happen
as quick as possible, but we don't want to reduce timeouts so much that
it will make other operations fail when we don't expect them to (e.g.
when the test wants to add a node to the cluster).
Unfortunately the selected 5 seconds in debug mode was not enough and
made the tests flaky: scylladb/scylladb#20111.
Increase it to 10 seconds. This unfortunately will slow down these tests
as they have to sometimes wait for 10 seconds for the timeout to happen.
But better to have this than a flaky test.
Fixes: scylladb/scylladb#20111Closesscylladb/scylladb#20320
The idea of the test is to have a small cluster, where one node
is stressed with injections and failures and the rest of
the cluster is used to make progress of the Raft state machine.
To achieve this following two lists introduced in the commit:
- ERROR_INJECTIONS in error_injections.py
- CLUSTER_EVENTS in cluster_events.py
Each cluster event is an async generator which has 2 yields and should be
used in the following way:
0. Start the generator:
>>> cluster_event_steps = cluster_event(manager, random_tables, error_injection)
1. Run the prepare part (before the first yield)
>>> await anext(cluster_event_steps)
2. Run the cluster event itself (between the yields)
>>> await anext(cluster_event_steps)
3. Run the check part (after the second yield)
>>> await anext(cluster_event, None)
Add get_non_coordinator_host() function which returns
ServerInfo for the first host which is not a coordinator
or None if there is no such host.
Also rework get_coordinator_host() to not fail if some
of the hosts don't have a host id.
Allow to return from server_add() when a server reaches specified state.
One of:
- PROCESS_STARTED
- HOST_ID_QUERIED (previously called NOT_CONNECTED)
- CQL_CONNECTED (renamed from CONNECTED)
- CQL_QUERIED (was just QUERIED)
Also, rename CqlUpState to ServerUpState and move to internal_types.
When creating sstables this test allocates temporary local options.
That works, because this test doesn't run on object storage, but it's
more correct to pick storage options from the table at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20440