Commit Graph

693 Commits

Author SHA1 Message Date
Avi Kivity
611918056a Merge 'repair: Add tablet incremental repair support' from Asias He
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results
    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472

Closes scylladb/scylladb#24291

* github.com:scylladb/scylladb:
  replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
  compaction: Move compaction_reenabler to compaction_reenabler.hh
  topology_coordinator: Make rpc::remote_verb_error to warning level
  repair: Add metrics for sstable bytes read and skipped from sstables
  test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
  test.py: Add tests for tablet incremental repair
  repair: Add tablet incremental repair support
  compaction: Add tablet incremental repair support
  feature_service: Add TABLET_INCREMENTAL_REPAIR feature
  tablet_allocator: Add tablet_force_tablet_count_increase and decrease
  repair: Add incremental helpers
  sstable: Add being_repaired to sstable
  sstables: Add set_repaired_at to metadata_collector
  mutation_compactor: Introduce add operator to compaction_stats
  tablet: Add sstables_repaired_at to system.tablets table
  test: Fix drain api in task_manager_client.py
2025-08-19 13:13:22 +03:00
Asias He
ad5275fd4c test.py: Add tests for tablet incremental repair
The following tests are added for tablet incremental repair:

- Basic incremental repair

- Basic incremental repair with error

- Minor compaction and incremental repair

- Major compaction and incremental repair

- Scrub compaction and incremental repair

- Cleanup/Upgrade compaction and incremental repair

- Tablet split and incremental repair

- Tablet merge and incremental repair
2025-08-18 11:01:21 +08:00
Evgeniy Naydanov
e44b26b809 test.py: pytest: support --mode/--repeat in a common way for all tests
Implement repetition of files using pytest_collect_file hook: run
file collection as many times as needed to cover all --mode/--repeat
combinations.  Also move disabled test logic to this hook.

Store build mode and run_id in pytest item stashes.

Simplify support of C++ tests: remove redundant facade abstraction and put
all code into 3 files: base.py, boost.py, and unit.py

Add support for `run_first` option in test_config.yaml
2025-08-17 15:26:23 +00:00
Evgeniy Naydanov
bffb6f3d01 test.py: pytest: streamline suite configuration handling
Move test_config.yaml handling code from common_cpp_conftest.py to
TestSuiteConfig class in test/pylib/runner.py
2025-08-17 12:32:36 +00:00
Evgeniy Naydanov
a2a59b18a3 test.py: refactor: remove unused imports in test.py
Also use the constant for "suite.yaml" string.
2025-08-17 12:32:36 +00:00
Evgeniy Naydanov
a188523448 test.py: fix run with bare pytest after merge of scylladb/scylladb#24573
To run tests with bare pytest command we need to have almost the
same set of options as test.py because we reuse code from test.py.

scylladb/scylladb#24573 added `--pytest-arg` option to test.py but
not to test/conftest.py which breaks running Python tests using
bare pytest command.
2025-08-17 12:32:35 +00:00
Evgeniy Naydanov
600d05471b test.py: refactor: move framework-related code to test.pylib.runner
Some test suites have own test runners based on pytest, and they
don't need all stuff we use for test.py.  Move all code related to
test.py framework to test/pylib/runner.py and use it as a plugin
conditionally (by using TEST_RUNNER variable.)
2025-08-17 12:32:35 +00:00
Evgeniy Naydanov
f2619d2bb0 test.py: resource_gather: add cwd parameter to run_process()
Also done sort arguments in Popen call to match the signature.
2025-08-17 12:32:35 +00:00
Benny Halevy
f22a870a04 test: cluster: test_repair: add test_vnode_keyspace_describe_ring
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-13 12:39:40 +03:00
Michael Litvak
5c28cffdb4 test/pylib/rest_client: fix ScyllaMetrics filtering
In the ScyllaMetrics `get` function, when requesting the value for a
specific shard, it is expected to return the sum of all values of
metrics for that shard that match the labels.

However, it would return the value of the first matching line it finds
instead of summing all matching lines.

For example, if we have two lines for one shard like:
some_metric{scheduling_group_name="compaction",shard="0"} 1
some_metric{scheduling_group_name="sl:default",shard="0"} 2

The result of this call would be 1 instead of 3:
get('some_metric', shard="0")

We fix this to sum all matching lines.

The filtering of lines by labels is fixed to allow specifying only some
of the labels. Previously, for the line to match the filter, either the
filter needs to be empty, or all the labels in the metric line had to be
specified in the filter parameter and match its value, which is
unexpected, and breaks when more labels are added.

We also simplify the function signature and the implementation - instead
of having the shard as a separate parameter, it can be specified as a
label, like any other label.
2025-08-10 10:16:00 +02:00
Botond Dénes
70aa81990b Merge 'Alternator - add the ability to write, not just read, system tables' from Nadav Har'El
In commit 44a1daf we added the ability to read Scylla system tables with Alternator. This feature is useful, among other things, in tests that want to read Scylla's configuration through the system table system.config. But tests often want to modify system.config, e.g., to temporarily reduce some threshold to make tests shorter. Until now, this was not possible

This series add supports for writing to system tables through Alternator, and examples of tests using this capability (and utility functions to make it easy).

Because the ability to write to system tables may have non-obvious security consequences, it is turned off by default and needs to be enabled with a new configuration option "alternator_allow_system_table_write"

No backports are necessary - this feature is only intended for tests. We may later decide to backport if we want to backport new tests, but I think the probability we'll want to do this is low.

Fixes #12348

Closes scylladb/scylladb#19147

* github.com:scylladb/scylladb:
  test/alternator: utility functions for changing configuration
  alternator: add optional support for writing to system table
  test/alternator: reduce duplicated code
2025-08-08 09:13:15 +03:00
Nadav Har'El
d632599a92 Merge 'test.py: native pytest repeats' from Andrei Chekun
Previous way of execution repeat was to launch pytest for each repeat.
That was resource consuming, since each time pytest was doing discovery
of the tests. Now all repeats are done inside one pytest process.

Backport for 2025.3 is needed, since this functionality is framework only, and 2025.3 affected with this slow repeats as well.

Closes scylladb/scylladb#25073

* github.com:scylladb/scylladb:
  test.py: add repeats in pytest
  test.py: add directories and filename to the log files
  test.py: rename log sink file for boost tests
  test.py: better error handling in boost facade
2025-08-06 18:18:03 +03:00
Nadav Har'El
a896e2dbb9 alternator: add optional support for writing to system table
In commit 44a1daf we added the ability to read system tables through
the DynamoDB API (actually, the Scan and Query requests only).
This ability is useful for tests, and can also be useful to users who
want to read information that is only available through system tables.

This patch adds support also for *writing* into system tables. This will
be useful for Alternator tests, were we want to temporarily change
some live-updatable configuration option - and so far haven't been
able to do that like we did do in some cql-pytest tests.

For reasons explained in issue #23218, only superuser roles are allowed to
write to system tables - it is not enough for the role to be granted
MODIFY permissions on the system table or on ALL KEYSPACES. Moreover,
the ability to modify system tables carries special risks, so this
patch only allows writes to the system tables if a new configuration
option "alternator_allow_system_table_write" turned on. This option is
turned off by default.

This patch also includes a test for this new configuration-writing
capability. The test scripts test/alternator/run and test.py now
run Scylla with alternator_allow_system_table_write turned on, but
the new test can also run without this option, and will be skipped
in that case (to allow running the test suite against some manually-
run instance of Scylla).

Fixes: #12348

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-06 10:00:04 +03:00
Andrei Chekun
c0d652a973 test.py: change boost test stdout to use filehandler instead of pipe
With current implementation if pytest will be killed, it will not be
able to write the stdout from the boost test. With a new way it should
be updated while test executing, instead of writing it the end of the
test.

Closes scylladb/scylladb#25260
2025-08-01 15:05:00 +03:00
Taras Veretilnyk
1d6808aec4 topology_coordinator: Make tablet_load_stats_refresh_interval configurable
This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds'
that allows overriding the default value without using error injection.

Fixes scylladb/scylladb#24641

Closes scylladb/scylladb#24746
2025-07-31 14:31:55 +03:00
Petr Gusev
3500a10197 scylla_cluster.py: add try_get_host_id
Tests sometimes fail in ScyllaCluster.add_server on the
'replaced_srv.host_id' line because host_id is not resolved yet. In
this commit we introduce functions try_get_host_id and get_host_id
that resolve it when needed.

Closes scylladb/scylladb#25177
2025-07-31 10:37:06 +02:00
Patryk Jędrzejczak
c41f0e6da9 Merge 'generic server: 2 step shutdown' from Sergey Zolotukhin
This PR implements solution proposed in scylladb/scylladb#24481

Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections.

The updated shutdown process is as follows:

1. Initial Shutdown Phase
   * Close the accept gate to block new incoming connections.
   * Abort all accept() calls.
   * For all active connections:
      * Close only the input side of the connection to prevent new requests.
      * Keep the output side open to allow responses to be sent.

2. Drain Phase
   * Wait for all in-progress requests to either complete or fail.

3. Final Shutdown Phase
   * Fully close all connections.

Fixes scylladb/scylladb#24481

Closes scylladb/scylladb#24499

* https://github.com/scylladb/scylladb:
  test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`,  decrease request timeout.
  generic_server: Two-step connection shutdown.
  transport: consmetic change, remove extra blanks.
  transport: Handle sleep aborted exception in sleep_until_timeout_passes
  generic_server: replace empty destructor with `= default`
  generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output`
  generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class.
  test: Add test for query execution during CQL server shutdown
2025-07-31 10:32:30 +02:00
Andrei Chekun
d0e4045103 test.py: add repeats in pytest
Previous way of executin repeat was to launch pytest for each repeat.
That was resource consuming, since each time pytest was doing discovery
of the tests. Now all repeats are done inside one pytest process.
2025-07-30 12:03:08 +02:00
Andrei Chekun
853bdec3ec test.py: add directories and filename to the log files
Currently, only test function name used for output and log files. For better
clarity adding the relative path from the test directory of the file name
without extension to these files.
Before:
test_aggregate_avg.1.log
test_aggregate_avg_stdout.1.log
After:
boost.aggregate_fcts_test.test_aggregate_avg.1.log
boost.aggregate_fcts_test.test_aggregate_avg_stdout.3.log
2025-07-30 12:03:08 +02:00
Andrei Chekun
557293995b test.py: rename log sink file for boost tests
Log sink is outputted in XML format not just simple text file. Renaming to have better clarity
2025-07-30 12:03:08 +02:00
Andrei Chekun
cc75197efd test.py: better error handling in boost facade
If test was not executed for some reason, for example not known parameter passed to the test, but boost framework was able to finish correctly, log file will have data but it will be parsed to an empty list. This will raise an exception in pytest execution, rather than produce test output. This change will handle this situation.
2025-07-30 12:03:08 +02:00
Sergey Zolotukhin
4f63e1df58 test: Set request_timeout_on_shutdown_in_seconds to request_timeout_in_ms,
decrease request timeout.

In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.
2025-07-29 15:37:47 +02:00
Asias He
e28c75aa79 repair: Avoid too many fragments in a single repair_row_on_wire
When repairing a partition with many rows, we can store many fragments
in a repair_row_on_wire object which is sent as a rpc stream message.

This could cause reactor stalls when the rpc stream compression is
turned on, because the compression compresses the whole message without
any split and compression.

This patch solves the problem at the higher level by reducing the
message size that is sent to the rpc stream.

Tests are added to make sure the message split works.

Fixes #24808
2025-07-29 13:43:53 +08:00
Dawid Mędrek
b41151ff1a test: Enable RF-rack-valid keyspaces in all Python suites
We're enabling the configuration option `rf_rack_valid_keyspaces`
in all Python test suites. All relevant tests have been adjusted
to work with it enabled.

That encompasses the following suites:

* alternator,
* broadcast_tables,
* cluster (already enabled in scylladb/scylladb@ee96f8dcfc),
* cql,
* cqlpy (already enabled in scylladb/scylladb@be0877ce69),
* nodetool,
* rest_api.

Two remaining suites that use tests written in Python, redis and scylla_gdb,
are not affected, at least not directly.

The redis suite requires creating an instance of Scylla manually, and the tests
don't do anything that could violate the restriction.

The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but
even then it reuses the `run` file from the cqlpy suite.

Fixes scylladb/scylladb#25126

Closes scylladb/scylladb#24617
2025-07-28 16:32:59 +02:00
Tomasz Grabiec
a1d7722c6d Merge 'api: repair_async: refuse repairing tablet keyspaces' from Aleksandra Martyniuk
A tablet repair started with /storage_service/repair_async/ API
bypasses tablet repair scheduler and repairs only the tablets
that are owned by the requested node. Due to that, to safely repair
the whole keyspace, we need to first disable tablet migrations
and then start repair on all nodes.

With the new API - /storage_service/tablets/repair -
tailored to tablet repair requirements, we do not need additional
preparation before repair. We may request it on one node in
a cluster only and, thanks to tablet repair scheduler,
a whole keyspace will be safely repaired.

Both nodetool and Scylla Manager have already started using
the new API to repair tablets.

Refuse repairing tablet keyspaces with /storage_service/repair_async -
403 Forbidden is returned. repair_async should still be used to repair
vnode keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/23008.

Breaking change; no backport.

Closes scylladb/scylladb#24678

* github.com:scylladb/scylladb:
  repair: remove unused code
  api: repair_async: forbid repairing tablet keyspaces
2025-07-27 09:25:42 +02:00
Botond Dénes
837424f7bb Merge 'Add Azure Key Provider for Encryption at Rest' from Nikos Dragazis
This PR introduces a new Key Provider to support Azure Key Vault as a Key Management System (KMS) for Encryption at Rest. The core design principle is the same as in the AWS and GCP key providers - an externally provided Vault key that is used to protect local data encryption keys (a process known as "key wrapping").

In more detail, this patch series consists of:
* Multiple Azure credential sources, offering a variety of authentication options (Service Principals, Managed Identities, environment variables, Azure CLI).
* The Azure host - the Key Vault endpoint bridge.
* The Azure Key Provider - the interface for the Azure host.
* Unit tests using real Azure resources (credentials and Vault keys).
* Log filtering logic to not expose sensitive data in the logs (plaintext keys, credentials, access tokens).

This is part of the overall effort to support Azure deployments.

Testing done:
* Unit tests.
* Manual test on an Azure VM with a Managed Identity.
* Manual test with credentials from Azure CLI.
* Manual test of `--azure-hosts` cmdline option.
* Manual test of log filtering.

Remaining items:
- [x] Create necessary Azure resources for CI.
- [x] Merge pipeline changes (https://github.com/scylladb/scylla-pkg/pull/5201).

Closes https://github.com/scylladb/scylla-enterprise/issues/1077.

New feature. No backport is needed.

Closes scylladb/scylladb#23920

* github.com:scylladb/scylladb:
  docs: Document the Azure Key Provider
  test: Add tests for Azure Key Provider
  pylib: Add mock server for Azure Key Vault
  encryption: Define and enable Azure Key Provider
  encryption: azure: Delegate hosts to shard 0
  encryption: Add Azure host cache
  encryption: Add config options for Azure hosts
  encryption: azure: Add override options
  encryption: azure: Add retries for transient errors
  encryption: azure: Implement init()
  encryption: azure: Implement get_key_by_id()
  encryption: azure: Add id-based key cache
  encryption: azure: Implement get_or_create_key()
  encryption: azure: Add credentials in Azure host
  encryption: azure: Add attribute-based key cache
  encryption: azure: Add skeleton for Azure host
  encryption: Templatize get_{kmip,kms,gcp}_host()
  encryption: gcp: Fix typo in docstring
  utils: azure: Get access token with default credentials
  utils: azure: Get access token from Azure CLI
  utils: azure: Get access token from IMDS
  utils: azure: Get access token with SP certificate
  utils: azure: Get access token with SP secret
  utils: rest: Add interface for request/response redaction logic
  utils: azure: Declare all Azure credential types
  utils: azure: Define interface for Azure credentials
  utils: Introduce base64url_{encode,decode}
2025-07-25 10:45:32 +03:00
Aleksandra Martyniuk
a0031ad05e api: repair_async: forbid repairing tablet keyspaces
Return 403 Forbidden if a user tries to repair tablet keyspace with
/storage_service/repair_async/ API.
2025-07-24 11:11:09 +02:00
Tomasz Grabiec
c9bf010d6d Merge 'test.py: skip cleaning testlog' from Andrei Chekun
Skip removing any artifacts when -s provided between test.py invocation.
Logs from the previous run will be overridden if tests were executed one
more time. Fox example:
1. Execute tests A, B, C with parameter -s
2. All logs are present even if tests are passed
3. Execute test B with parameter -s
4. Logs for A and C are from the first run
5. Logs for B are from the most recent run

Backport is not needed, since it framework enhancement.

Closes scylladb/scylladb#24838

* github.com:scylladb/scylladb:
  test.py: skip cleaning artifacts when -s provided
  test.py: move deleting directory to prepare_dir
2025-07-24 09:46:42 +03:00
Avi Kivity
e4c4141d97 test.py: don't crash on early cleanup of ScyllaServer
If a test fails very early (still have to find why), test.py
crashes while flushing a non-existent log_file, as shown below.

To fix, initialize the property to None and check it during
cleanup.

```
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------

'ScyllaServer' object has no attribute 'log_file'
test_cluster_features Traceback (most recent call last):
  File "/home/avi/scylla-maint/./test.py", line 816, in <module>
    sys.exit(asyncio.run(main()))
             ~~~~~~~~~~~^^^^^^^^
  File "/usr/lib64/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/lib64/python3.13/asyncio/base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/home/avi/scylla-maint/./test.py", line 523, in main
    total_tests_pytest, failed_pytest_tests = await run_all_tests(signaled, options)
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/avi/scylla-maint/./test.py", line 452, in run_all_tests
    failed += await reap(done, pending, signaled)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/avi/scylla-maint/./test.py", line 418, in reap
    result = coro.result()
  File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 143, in run
    return await super().run(test, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/avi/scylla-maint/test/pylib/suite/base.py", line 216, in run
    await test.run(options)
  File "/home/avi/scylla-maint/test/pylib/suite/topology.py", line 48, in run
    async with get_cluster_manager(self.uname, self.suite.clusters, str(self.suite.log_dir)) as manager:
               ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.13/contextlib.py", line 221, in __aexit__
    await anext(self.gen)
  File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 2006, in get_cluster_manager
    await manager.stop()
  File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 1539, in stop
    await self.clusters.put(self.cluster, is_dirty=True)
  File "/home/avi/scylla-maint/test/pylib/pool.py", line 104, in put
    await self.destroy(obj)
  File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 65, in recycle_cluster
    srv.log_file.close()
    ^^^^^^^^^^^^
AttributeError: 'ScyllaServer' object has no attribute 'log_file'
```

Closes scylladb/scylladb#24885
2025-07-22 12:39:01 +02:00
Nikos Dragazis
083aabe0c6 pylib: Add mock server for Azure Key Vault
The Azure Key Provider depends on three Azure services:

- Azure Key Vault
- IMDS
- Entra STS

To enable local testing, introduce a mock server that offers all the
needed APIs from these services. The server also offers an error
injection endpoint to configure a particular service to respond with
some error code for a number of consecutive requests.

The server is integrated as a 3rd party service in test.py.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:09 +03:00
Patryk Jędrzejczak
a654101c40 Merge 'test.py: add missed parameters that should be passed from test.py to pytest' from Andrei Chekun
Several parameters that `test.py` should pass to pytest->boost were missing. This PR adds handling these parameters: `--random-seed` and `--x-log2-compaction-groups`

Since this code affected with this issue in 2025.3 and this is only framework change, backport for that version needed.

Fixes: https://github.com/scylladb/scylladb/issues/24927

Closes scylladb/scylladb#24928

* https://github.com/scylladb/scylladb:
  test.py: add bypassing x_log2_compaction_groups to boost tests
  test.py: add bypassing random seed to boost tests
2025-07-16 15:29:17 +02:00
Andrei Chekun
a8fd38b92b test.py: skip discovery when combined_test binary absent
To discover what tests are included into combined_tests, pytest check this at
the very beginning. In the case if combined_tests binary is missing, it will
fail discovery and will not run test, even when it was not included into
combined_tests. This PR changes behavior, so it will not fail when
combined_tests is missing and only fail in case someone tries to run test from
it.

Closes scylladb/scylladb#24761
2025-07-15 09:49:02 +02:00
Andrei Chekun
f7c7877ba6 test.py: add bypassing x_log2_compaction_groups to boost tests
Bypassing argument to pytest->boost that was missing.
2025-07-11 12:30:09 +02:00
Andrei Chekun
71b875c932 test.py: add bypassing random seed to boost tests
Bypassing argument to pytest->boost that was missing.

Fixes: https://github.com/scylladb/scylladb/issues/24927
2025-07-11 12:30:08 +02:00
Andrei Chekun
ae6dc46046 test.py: skip cleaning artifacts when -s provided
Skip removing any artifacts when -s provided between test.py invocation.
Logs from the previous run will be overridden if tests were executed one
more time. Fox example:
1. Execute tests A, B, C with parameter -s
2. All logs are present even if tests are passed
3. Execute test B with parameter -s
4. Logs for A and C are from the first run
5. Logs for B are from the most recent run
2025-07-07 15:42:11 +02:00
Andrei Chekun
d81820f529 test.py: move deleting directory to prepare_dir
Instead of explicitly call removing directory move it to prepare_dir
method. If the passed pattern is '*' than directory will be deleted, in
other casses only files found by pattern
2025-07-04 13:39:42 +02:00
Pavel Emelyanov
4d4406c5bc Merge 'test.py: dtest: port next_gating tests from auth_test.py' from Evgeniy Naydanov
Copy `auth_test.py` from scylla-dtest test suite, remove all not next_gating tests from it, and make it works with `test.py`

As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `required_features("!consistent-topology-changes")` marker.

Remove `test_permissions_caching` test because it's too flaky when running using test.py

Also, make few time execution optimizations:
  - remove redundant `time.sleep(10)`
  - use smaller timeouts for CQL sessions

Enable the test in `suite.yaml` (run in dev mode only.)

Additional modifications to test.py/dtest shim code:

- Modify ManagerClient.server_update_config() method to change multiple config options in one call in addition to one `key: value` pair.
- Implement the method using slightly modified `set_configuration_options()` method of `ScyllaCluster`.
- Copy generate_cluster_topology() function from tools/cluster_topology.py module.
- Add support for `bootstrap` parameter for `new_node()` function.
- Rework `wait_for_any_log()` function.

Closes scylladb/scylladb#24648

* github.com:scylladb/scylladb:
  test.py: dtest: make auth_test.py run using test.py
  test.py: dtest: rework wait_for_any_log()
  test.py: dtest: add support for bootstrap parameter for new_node
  test.py: dtest: add generate_cluster_topology() function
  test.py: dtest: add ScyllaNode.set_configuration_options() method
  test.py: pylib/manager_client: support batch config changes
  test.py: dtest: copy unmodified auth_test.py
  test.py: dtest: add missed markers to pytest.ini
2025-07-04 10:51:52 +03:00
Patryk Jędrzejczak
8d925b5ab4 test: increase the default timeout of graceful shutdown
Multiple tests are currently flaky due to graceful shutdown
timing out when flushing tables takes more than a minute. We still
don't understand why flushing is sometimes so slow, but we suspect
it is an issue with new machines spider9 and spider11 that CI runs
on. All observed failures happened on these machines, and most of
them on spider9.

In this commit, we increase the timeout of graceful shutdown as
a temporary workaround to improve CI stability. When we get to
the bottom of the issue and fix it, we will revert this change.

Ref #12028

It's a temporary workaround to improve CI stability, we don't
have to backport it.

Closes scylladb/scylladb#24802
2025-07-04 10:43:38 +03:00
Pavel Emelyanov
fa0077fb77 Merge 'S3 chunked download source bug fixes' from Ernest Zaslavsky
- Fix missing negation in the `if` in the background downloading fiber
- Add test to catch this case
- Improve the s3 proxy to inject errors if the same resource requested more than once
- Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque
- Inject exception from the test to simulate response callback failure in the middle

No need to backport anything since this class in not used yet

Closes scylladb/scylladb#24657

* github.com:scylladb/scylladb:
  s3_test: Add s3_client test for non-retryable error handling
  s3_test: Add trace logging for default_retry_strategy
  s3_client: Fix edge case when the range is exhausted
  s3_client: Fix indentation in try..catch block
  s3_client: Stop retries in chunked download source
  s3_client: Enhance test coverage for retry logic
  s3_client: Add test for Content-Range fix
  s3_client: Fix missing negation
  s3_client: Refine logging
  s3_client: Improve logging placement for current_range output
2025-07-02 14:45:10 +03:00
Konstantin Osipov
37fc4edeb5 test.py: add a way to provide pytest arguments via test.py
Now that we use a single pytest.ini for all tests, different
developer preferences collide. There should be an easy way to override
pytest.ini defaults from the command line.

Fixes https://github.com/scylladb/scylladb/issues/21800

Closes scylladb/scylladb#24573
2025-07-02 12:20:43 +03:00
Ernest Zaslavsky
c75acd274c s3_client: Enhance test coverage for retry logic
Extend the S3 proxy to support error injection when the client
makes multiple requests to the same resource—useful for testing
retry behavior and failure handling.
2025-07-01 18:45:17 +03:00
Tomasz Grabiec
97679002ee Merge 'Co-locate tablets of different tables' from Michael Litvak
Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index.

main changes and ideas:
* "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables.
* new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information.
* co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator.
* resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size.
* the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account.
* view tablets are co-located with base tablets when the partition keys match.

Fixes https://github.com/scylladb/scylladb/issues/17043

backport is not needed. this is preliminary work for support of MVs and CDC with tablets.

Closes scylladb/scylladb#22906

* github.com:scylladb/scylladb:
  tablets: validate no clustering row mutations on co-located tables
  raft_group0_client: extend validate_change to mixed_change type
  docs: topology-over-raft: document co-located tables
  tablet-mon.py: visual indication for co-located tablets
  tablet-mon.py: handle co-located tablets
  test/boost/view_schema_test.cc: fix race in wait_until_built
  boost/tablets_test: test load balancing and resize of co-located tablets
  test/tablets: test tablets colocation
  tablets: co-locate view tablets with base when the partition keys match
  test/pylib/tablets: common get_tablet_count api
  test_mv_tablets: use get_tablet_replicas from common tablets api
  test/pylib/tablets: fix test api to read tablet replicas from base table
  tablets: allocator: create co-located tables in a single operation
  alternator: prepare all new tables in a single announcement
  migration_manager: add notification for creating multiple tables
  tablets: read_tablet_transition_stage: read from base table
  storage service: allow repair request only on base tables
  tablets: keyspace_rf_change: apply on base table
  storage service: generate tablet migration updates on base tables
  tablets: replace all_tables method
  tablets: split when all co-located tablets are ready
  tablets: load balancer: sizing plan for table groups
  tablets: load balancer: handle co-located tablets
  tablets: allocate co-located tablets
  tablets: handle migration of co-located tablets
  storage service: add repair colocated tablets rpc
  tablets: save and read tablet metadata of co-located tables
  tablets: represent co-located tables in tablet metadata
  tablets: add base_table column to system.tablets
  docs: update system.tablets schema
2025-07-01 16:02:30 +02:00
Michael Litvak
e01aae7871 test/pylib/tablets: common get_tablet_count api
Introduce a common get_tablet_count test api instead of it being
duplicated in few tests, and fix it to read the tablet count from the
base table.
2025-07-01 13:20:19 +03:00
Michael Litvak
6bfb82844f test/pylib/tablets: fix test api to read tablet replicas from base table
When reading tablet replicas from system.tablets, we need to refer to
the base table partition, if any.

We fix and simplify the test api for reading tablet replicas to read
from the base table.
2025-07-01 13:20:19 +03:00
Evgeniy Naydanov
e30e2345b7 test.py: dtest: rework wait_for_any_log()
Make `wait_for_any_log()` function to work closer to the original
dtest's version: use `ScyllaLogFile.grep()` method instead of
the usage of `ScyllaNode.wait_log_for()` with a small timeout to
have at least one try to find.

Also, add `max_count` argument to `.grep()` method for the
optimization purpose.
2025-06-30 10:16:36 +00:00
Evgeniy Naydanov
a1ce3aed44 test.py: pylib/manager_client: support batch config changes
Modify ManagerClient.server_update_config() method to change
multiple config options in one call in addition to one `key: value`
pair.  All internal machinery converted to get a values dict as a
parameter.  Type hints were adjusted too.
2025-06-30 10:16:36 +00:00
Andrei Chekun
c6c3e9f492 test.py: use unique hostname for Minio
To avoid situation that port is occupied on localhost, use unique
hostname for Minio
2025-06-30 12:03:06 +02:00
Pavel Emelyanov
4c0154f156 Merge 'test.py: enhance allure reporting' from Andrei Chekun
Add run ID for process output file to be not overwritten in the next case: first run failed, second passed. They are using the same name, so the second run will overwrite and delete the file. This will help to investigate in case of C++ test fails
Add attaching Scylla log files to allure report in case test failed. This is an alternative for link in JUnit report that exists in CI. That change will help to investigate the cluster tests fails. Example can be found in the failed [job](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2980/allure/).

Backport is not needed, this is only framework enhancements

Closes scylladb/scylladb#24677

* github.com:scylladb/scylladb:
  test.py: Attach node logs in allure report in case of fail
  test.py: Add run id to the boost output file
2025-06-27 16:22:03 +03:00
Andrei Chekun
2c726c5074 test.py: Attach node logs in allure report in case of fail
Currently, allure report have no nodes logs in case of fail, this will
allow to view the logs in one place without going anywhere else.
2025-06-26 15:37:33 +02:00
Andrei Chekun
156e7d2e7a test.py: Add run id to the boost output file
To avoid overwriting the output tests adding the run id to it.
Previously, when first repeat failed and the second passes, because the
are using the same name for the output, it will be overwritten and
deleted since the second repeat passed
2025-06-26 12:51:15 +02:00