Add docstrings to server_add(), server_start(), and servers_add() explaining
that they wait for ServerUpState.SERVING before returning, which means Scylla
has finished listening on all configured ports (including non-default ones).
Note that server_add() and server_start() accept expected_server_up_state to
return earlier if needed, while servers_add() always waits for SERVING.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For the same reason server_add() was changed to default to SERVING
(see previous commit), server_start() had the same bug: after restarting
a node that listens on non-default ports, the polling of the hardcoded
CQL/Alternator ports could succeed before the custom ports were ready,
causing intermittent failures.
Apply the same fix to server_start() in manager_client.py,
ScyllaCluster.server_start(), and the _cluster_server_start HTTP handler.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
server_add() was defaulting to ServerUpState.CQL_ALTERNATOR_QUERIED,
which polls the standard CQL and Alternator ports to determine when the
server is ready. This is wrong when a test configures Scylla to listen
on non-default ports: the polling succeeds on the default ports while the
custom ports may not yet be ready, making such tests intermittently flaky.
The correct behavior is ServerUpState.SERVING, which waits for Scylla's
sd_notify("READY=1") signal. This signal is sent only after all
configured listeners — including custom ports — are fully open, so it
is the right readiness signal regardless of the port configuration.
Up to now, the fix for each affected test was to pass
expected_server_up_state=ServerUpState.SERVING explicitly once the
flakiness was noticed (e.g. #29737). Change the default so that all
future tests get the correct behavior automatically.
Changed in manager_client.server_add(), ScyllaCluster.add_server(), and
the _cluster_server_add HTTP handler. The multi-server servers_add() path
already inherits the new default through add_server().
Fixes SCYLLADB-1822
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
server_add() only waits for CQL readiness before returning. The
Alternator HTTP port may not be listening yet, causing
ConnectionRefused with Alternator tests.
Extend the ServerUpState enum and startup loop to also check Alternator
port readiness when configured. Whenever Alternator port(s) is/are
configured, each is verified if connectable and queryable,
similar to how CQL ports are probed.
Fixes SCYLLADB-1701
Closesscylladb/scylladb#29625
Using cluster.shutdown() is an incorrect way to shut down a Cassandra Cluster.
The correct way is using safe_driver_shutdown.
Fixes SCYLLADB-1434
Closesscylladb/scylladb#29390
Accept both list[str] (from distinct_errors=True) and
list[list[str]] (from distinct_errors=False) in filter_errors(),
matching against the first line of each error group. This allows
tests that call grep_for_errors() with default arguments to
pipe results directly through filter_errors().
get_cql_exclusive() creates a Cluster object per call, but never
records it. driver_close() cannot shut it down. The cluster's
internal scheduler thread then tries to submit work to an already
shut down executor. This causes RuntimeError:
RuntimeError: cannot schedule new futures after shutdown
Fix this by tracking every exclusive Cluster in a list and shutting
them all down in driver_close().
Refs SCYLLADB-573
This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider
as parameter. This will be used in a follow up patch which migrates
audit test suite to test/cluster and requires this functionality for
some tests.
Refs SCYLLADB-573
There is a race condition in driver that raises the RuntimeException.
This pollutes the output, so this PR is just silencing this exception.
Fixes: SCYLLADB-900
Closesscylladb/scylladb#28957
Fixed several places where ScyllaLogFile.grep() was called without
await, resulting in checking coroutine objects for truthiness instead
of actual log matches.
Fixes: SCYLLADB-903
When connect_driver=False, the expected server up state should be
capped to HOST_ID_QUERIED. This is to avoid waiting for CQL readiness,
which requires a superuser to be present.
This logic was only in ScyllaCluster.server_start. ManagerClient.server_add
with start=True and connect_driver=False would still wait for CQL and hang
if no superuser is present. The workaround was to call
ManagerClient.server_add(start=False, connect_driver=False) followed by
ManagerClient.server_start(connect_driver=False).
This patch moves the capping from ScyllaCluster.server_start to
ManagerClient.server_add and ManagerClient.server_start, where connect_driver
is processed. ScyllaCluster only receives the already resolved
expected_server_up_state value.
Refs SCYLLADB-409
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.
Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.
Flow of decommission/removenode request:
1) pending and paused, has tablet replicas on target node.
Tablet scheduler will start draining tablets.
2) No tablets on target node, request is pending but not paused
3) Request is scheduled, node is in transition
4) Request is done
Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.
When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).
Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.
Fixes#21452Closesscylladb/scylladb#24129
* https://github.com/scylladb/scylladb:
docs: Document parallel decommission and removenode and relevant task API
test: Add tests for parallel decommission/removenode
test: util: Introduce ensure_group0_leader_on()
test: tablets: Check that there are no migrations scheduled on draining nodes
test: lib: topology_builder: Introduce add_draining_request()
topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
tablets: topology_coordinator: Refactor to propagate reason for migration rollback
tablet_allocator: Skip co-location on draining nodes
node_ops: task_manager_module: Populate entity field also for active requests
tasks: node_ops: Put node id in the entity field
tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
topology: Protect against empty cancelation reason
tasks, topology: Make pending node operations abortable
doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
raft_topology, tablets: Drain tablets in parallel with other topology operations
virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
locator: topology: Add "draining" flag to a node
topology_coordinator: Extract generate_cancel_request_update()
storage_service: Drop dependency in topology_state_machine.hh in the header
locator: Extract common code in assert_rf_rack_valid_keyspace()
topology_coordinator, storage_service: Validate node removal/decommission at request submission time
Many tests want to assume that group0 leader runs on a particualr
server, typically the first server in the list.
And they cannot be easily made to work with arbitrary leader, becuase
they setup a particular topology and then stop particular nodes, and
want to assume the leader is stable. They open leader's log and
expect things to appear in that log.
It's much easier to ensure the leader, than to prepare tests to
handle failovers.
It's a global operation, so we can use any server.
It's not only convenient. The call via api.disable_tablet_balancing()
confuse people to think that it's a per-server operation. This leads
to proliferation of code which does it needlessly on all servers.
test_tablets.test_orphaned_sstables_on_startup verifies that an
on_internal_error("Unable to load SSTable...") is generated when
an sstable outside a tablet boundary is found on startup.
The test indeed finds the error, but then proceeds to hang in
find_backtraces(), or fail if find_backtraces() is fixed, since
it finds an unexpected (for it) crash.
Fix this by not looking for crashes if a new option expected_crash
is set. Set it for this test.
After tests end, an extra check if performed, looking into node logs.
By default, it only searches for critical errors and scans for coredumps.
If the test has the fixture `check_nodes_for_errors`, it will search for all errors.
Both checks can be ignored by setting `ignore_cores_log_patterns` and `ignore_log_patterns`.
If any of the above are found, the test will fail with an error.
Adding pid info to servers allows matching coredumps with servers
Other improvements:
- When replacing just some fields of ServerInfo, use `_replace` instead of
building a new object. This way it is agnostic to changes to the Object
- When building ServerInfo from a list, the types defined for its fields are
not enforced, so ServerInfo(*list) works fine and does not need to be changed if
fields are added or removed.
The universalasync.wrap function doesn't preserve the
type information, which confuses the VS Code Pylance
plugin and makes code navigation hard.
In this commit we fix the problem by adding a typed
wrapped around universalasync.wrap.
Fixes: scylladb/scylladb#26639
Some tests need the ability to abruptly stop a server in the test
cluster before it fully booted - e.g., because the test knows (and
perhaps even expects) that the boot is hung. But before this patch,
manager.server_stop() could only kill servers in "running" state.
This patch adds to pylib tracking of "starting" servers - servers which
we are starting but haven't finished booting - their list can be
returned by the manager.starting_servers(). The manage.server_stop
function can now kill a server which is just starting - not just
"running" servers.
To avoid breaking existing tests, manager.all_servers() continues to
return just running and stopped servers - not "starting" servers.
By the way, when a starting server is killed, it is not listed as stopped -
it just behaves as a normal failure to add the server, and not as a
server which successfully joined the cluster but was later stopped.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, there is no simple way to remove an option from the server's
config file in tests. One example when this is needed is removing the
`recovery_leader` option on all servers during the recovery procedure.
In this commit, we add a new method to `ManagerClient` that removes
an option from the given server's config file.
decrease request timeout.
In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.
Copy `auth_test.py` from scylla-dtest test suite, remove all not next_gating tests from it, and make it works with `test.py`
As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `required_features("!consistent-topology-changes")` marker.
Remove `test_permissions_caching` test because it's too flaky when running using test.py
Also, make few time execution optimizations:
- remove redundant `time.sleep(10)`
- use smaller timeouts for CQL sessions
Enable the test in `suite.yaml` (run in dev mode only.)
Additional modifications to test.py/dtest shim code:
- Modify ManagerClient.server_update_config() method to change multiple config options in one call in addition to one `key: value` pair.
- Implement the method using slightly modified `set_configuration_options()` method of `ScyllaCluster`.
- Copy generate_cluster_topology() function from tools/cluster_topology.py module.
- Add support for `bootstrap` parameter for `new_node()` function.
- Rework `wait_for_any_log()` function.
Closesscylladb/scylladb#24648
* github.com:scylladb/scylladb:
test.py: dtest: make auth_test.py run using test.py
test.py: dtest: rework wait_for_any_log()
test.py: dtest: add support for bootstrap parameter for new_node
test.py: dtest: add generate_cluster_topology() function
test.py: dtest: add ScyllaNode.set_configuration_options() method
test.py: pylib/manager_client: support batch config changes
test.py: dtest: copy unmodified auth_test.py
test.py: dtest: add missed markers to pytest.ini
Multiple tests are currently flaky due to graceful shutdown
timing out when flushing tables takes more than a minute. We still
don't understand why flushing is sometimes so slow, but we suspect
it is an issue with new machines spider9 and spider11 that CI runs
on. All observed failures happened on these machines, and most of
them on spider9.
In this commit, we increase the timeout of graceful shutdown as
a temporary workaround to improve CI stability. When we get to
the bottom of the issue and fix it, we will revert this change.
Ref #12028
It's a temporary workaround to improve CI stability, we don't
have to backport it.
Closesscylladb/scylladb#24802
When reading tablet replicas from system.tablets, we need to refer to
the base table partition, if any.
We fix and simplify the test api for reading tablet replicas to read
from the base table.
Modify ManagerClient.server_update_config() method to change
multiple config options in one call in addition to one `key: value`
pair. All internal machinery converted to get a values dict as a
parameter. Type hints were adjusted too.
Add run ID for process output file to be not overwritten in the next case: first run failed, second passed. They are using the same name, so the second run will overwrite and delete the file. This will help to investigate in case of C++ test fails
Add attaching Scylla log files to allure report in case test failed. This is an alternative for link in JUnit report that exists in CI. That change will help to investigate the cluster tests fails. Example can be found in the failed [job](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2980/allure/).
Backport is not needed, this is only framework enhancements
Closesscylladb/scylladb#24677
* github.com:scylladb/scylladb:
test.py: Attach node logs in allure report in case of fail
test.py: Add run id to the boost output file
Waiting for CQL requires default superuser being present
in db. In some cases we may delete it and still want to do
rolling restart. Additionally if we need CQL we may want to
wait after restart is complete (once, and not for each node).
Use universalasync library to make test.py async code compatible
with synchronous code of dtest/ccm
Also, copied unmodified error_example_test.py from dtest as an example.
Run the test in `dev` mode only.
Add parameters to server_start() method to provide ability to
change Scylla' CLI and env options on a node start.
Also, add `expected_server_up_state` parameter as we have for
server_add() method.
This commit adds to ManagerClient a get_cql_exclusive function that
allows creating a cql connection with WhiteListRoundRobinPolicy for
a single server. Such connection is useful in tests that kill nodes to
make sure that the live node handles the queries. Before this commit,
some tests used cluster_con from test/cluster/conftest.py, and after
this commit test can start to use a method from MangerClient.
This change:
- Extend ManagerClient con_gen type to allow LoadBalancingPolicy arg
- Implement get_cql_exclusive()
So that a multi-dc/multi-rack cluster can be populated
in a single call.
* Enhancement, no backport required
Closesscylladb/scylladb#23341
* github.com:scylladb/scylladb:
test/pylib: servers_add: add auto_rack_dc parameter
test/pylib: servers_add: support list of property_files
Run ScyllaClusterManager using pytest fixture if `--manager-api`
option is not provided.
On this stage we're trying to be as close to test.py as possible.
test.py runs tests file-by-file, so, effectively, scopes `session`,
`package`, and `module` are pretty same. Also, test.py starts
ScyllaClusterManager for every test module and this is the reason
why fixture `manager_api_sock_path` has scope=`module`. And, in
result, we need to change scope for fixture `manager_internal` too.
This commit introduces connect_driver argument in
ManagerClient::server_add. The argument allow skipping CQL driver
initialization part during server start. Starting a server without
the driver is necessary to implement some test scenarios related
to system initialization.
After stopping a server, ManagerClient::server_start can be used to
start the server again, so connect_driver argument is also added here to
allow preventing connecting the driver after a server restart.
This change:
- Implement connect_driver argument in ManagerClient::server_add
- Implement connect_driver argument in ManagerClient::server_start
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.
Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.
Fixes: scylladb/scylladb#19959Closesscylladb/scylladb#21477
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.
Fixes: #20240Closesscylladb/scylladb#21007