Symptom: the rest_api_mock subprocess exits with status 1 during fixture
setup, e.g.:
subprocess.CalledProcessError: Command '[..., 'rest_api_mock.py',
'127.29.88.1', '34093']' returned non-zero exit status 1
Root cause: aiohttp's TCPSite.start() raises OSError(EADDRINUSE) and the
process exits 1. The bind fails because of how the (ip, port) pair is
chosen across modules within one test.py process:
* Each test module leases a 127.x.y.z IP from the host registry. The
registry recycles released IPs, so the same IP is shared across
modules sequentially.
* The original code picked the port via random.randint(10000, 65535).
A previous module on the same IP could have left that port in
TIME_WAIT (or worse, still actively in use) when a later module
happened to pick the same port.
SCYLLADB-1275 (PR 29314) tried to fix this by binding a probe socket to
(ip, 0) to obtain an OS-assigned free port, closing the probe, then
launching the mock server which would bind to that port. Two issues
remained:
1. TOCTOU: between probe close and mock-server bind, any other process
on the host could grab the just-freed port.
2. TIME_WAIT could still bite if the host registry recycled an IP and
the OS reused the same port number for the probe.
Fix: drop port discovery entirely. Use a fixed port (12345, matching the
unshare-namespace path already in this fixture) on the unique IP from
the host registry. Because IPs are unique per test module within one
test.py process, the (ip, 12345) pair is unique to each module, so no
port-collision dance is needed.
reuse_address=True on TCPSite handles the residual TIME_WAIT case when
the host registry recycles an IP within the same test.py process and
the previous mock server's socket has not finished TIME_WAIT yet.
reuse_port=True is dropped, as it was only useful while attempting to
have multiple processes share a single port.
This mirrors the design used in test/cqlpy/run.py: pick a unique IP,
keep the port fixed.
Fixes: SCYLLADB-1718
Closesscylladb/scylladb#29656
Migrate runtime pytest.skip() calls across 34 files to use the typed
skip_env() wrapper from test.pylib.skip_types.
These sites skip at runtime because a required feature, config option,
library version, build mode, or runtime topology is not available.
Also fixes 'raise pytest.skip(...)' in test_audit.py — skip_env()
already raises internally, so the explicit raise was incorrect.
Each file gains one new import:
from test.pylib.skip_types import skip_env
With migration to pyest this fixture is useless. Removing and setting
the session to the module for the most of the tests.
Add dynamic_scope function to support running alternator fixtures in
session scope, while Test and TestSuite are not deleted. This is for
migration period, later on this function should be deleted.
Replace the random port selection with an OS-assigned port. We open
a temporary TCP socket, bind it to (ip, 0) with SO_REUSEADDR, read back
the port number the OS selected, then close the socket before launching
rest_api_mock.py.
Add reuse_address=True and reuse_port=True to TCPSite in rest_api_mock.py
so the server itself can also reclaim a TIME_WAIT port if needed.
Fixes: SCYLLADB-1275
Closesscylladb/scylladb#29314
This fixtures starts the mock server and immediately connects to it to
setup the expected requests. The connection attempt might be too early,
so there is a retry loop with a timeout. The loop currently checks for
requests.exception.ConnectionError. We've seen a case where the
connection is successful but the request fails with 404. The mock
started the server but didn't setup the routes yet. Add a retry for http
404 to handle this.
Fixes: SCYLLADB-966
Closesscylladb/scylladb#29003
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.
Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.
If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.
Fixes: SCYLLADB-568.
Closesscylladb/scylladb#28739
Add explicit default values to pytest command line options to prevent
issues when running tests with pytest's parallel execution where
options are not present on upper conftest, so they're just not set at all.
The "--primary-replica-only" ("-pro") flag was previously ignored by
the `restore` operation. This patch ensures the argument is parsed and
applied correctly.
Closesscylladb/scylladb#28490
As a next step of migration to the pytest runner, this PR moves
responsibility for nodetool tests execution solely to the pytest.
Closesscylladb/scylladb#28348
This problem and its fix was suggested by copilot, I'm just writing the
cover letter.
test/nodetool/test_status.py has the silly statement tokens == "?" which
has no effect. Looking around the code suggested to me (and also to
Copilot, nice) that the correct intent was assert tokens == "?" and not,
say, tokens = "?".
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27659
There are two test with name test_repair_options_hosts_tablets in
test/nodetool/test_cluster_repair.py and and two test_repair_keyspace
in test/nodetool/test_repair.py. Due to that one of each pair is ignored.
Rename the tests so that they are unique.
Fixes: https://github.com/scylladb/scylladb/issues/27701.
Closesscylladb/scylladb#27720
This reverts commit 8192f45e84.
The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.
The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.
Fixes https://github.com/scylladb/scylladb/issues/26866
Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing.
Closesscylladb/scylladb#26868
* https://github.com/scylladb/scylladb:
cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
cleanup: Add RESTful API to allow reset cleanup needed flag
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.
So far it was not allowed to pass a scope when using
the primary_replica_only option, this patch enables
it because the concepts are now combined so that:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This PR introduces support for a new scrub option: `--drop-unfixable-sstables`, which enables the dropping of corrupted SSTables during scrub only in segregate mode. The patch includes implementation, validation, and set of tests to ensure correct behavior and error handling.
Fixes#19060
Backport is not required, it is a new feature
Closesscylladb/scylladb#26579
* github.com:scylladb/scylladb:
sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option
test/nodetool: add scrub drop-unfixable-sstables option testcase
scrub: add support for dropping unfixable sstables in segregate mode
This patches introduces the test_scrub_drop_unfixable_sstables_option testcase,
which verifies that correct request is generated when the --drop-unfixable-sstables flag is used.
It also validates that an error is thrown if the drop-unfixable-sstables
flag is enabled and mode is not set to SEGREGATE.
This patch introduces test_scrub_drop_unfixable_sstables_option, which test
When requesting repair for tablets of a colocated table, the request
fails with an error. Improve the error message to show the table names
instead of table IDs, because the table names are more useful for users.
Fixesscylladb/scylladb#26567Closesscylladb/scylladb#26568
Using the name regular as the incremental mode could be confusing, since
regular might be interpreted as the non-incremental repair. It is better
to use incremental directly.
Before:
- regular (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
After:
- incremental (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
Fixes#26503Closesscylladb/scylladb#26504
Adds a parameterized test to verify that multiple --key-components arguments
are handled correctly by nodetool's getendpoints command. Ensures the
constructed REST request includes all key_component values in the expected format.
Previously, only the last value of a repeated query parameter was captured,
which could cause inaccurate request matching in tests. This update ensures
that all values are preserved by storing duplicates as lists in the `params` dict.
when cluster repair is run for an entire keyspace, nodetool makes a
repair api request for each table.
if the keyspace contains colocated tables, then the api request for the
colocated tables will fail, because currently scylla doesn't allow making
repair requests for specific colocated tables, but only for base tables.
if the request is to repair an entire keyspace then we can ignore this,
because we will make a repair request for all base tables, and this in
turn will repair also all the colocated tables in the keyspace.
however if specific tables are requested and some of them are colocated
then we should propagate the error to let the user know the request is
invalid.
Refs scylladb/scylladb#24816
The `--incremental-mode` option specifies the incremental repair mode.
Can be 'disabled', 'regular', or 'full'.
'regular': The incremental repair logic is enabled. Unrepaired sstables
will be included for repair. Repaired sstables will be skipped. The
incremental repair states will be updated after repair.
'full': The incremental repair logic is enabled. Both repaired and
unrepaired sstables will be included for repair. The incremental repair
states will be updated after repair.
'disabled': The incremental repair logic is disabled completely. The
incremental repair states, e.g., repaired_at in sstables and
sstables_repaired_at in the system.tablets table, will not be updated
after repair.
When the option is not provided, it defaults to regular.
Fixes#25931Closesscylladb/scylladb#25969
- Add dropquarantinedsstables command to remove quarantined SSTables
- Support both flag-based (--keyspace, --table) and positional arguments
- Allow targeting all keyspaces, specific keyspace, or keyspace with specified tables
Fixesscylladb/scylladb#19061
nodetool repair command repairs only vnode keyspaces. If a user tries
to repair a tablet keyspace, an exception is thrown.
Closesscylladb/scylladb#23660
This patch adds the new option in nodetool, patches the
load_new_ss_tables REST request with a new parameter and
skips the reshape step in refresh if this flag is passed.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#24409Fixes: #24365
This change adds the --scope option to nodetool refresh.
Like in the case of nodetool restore, you can pass either of:
* node - On the local node.
* rack - On the local rack.
* dc - In the datacenter (DC) where the local node lives.
* all (default) - Everywhere across the cluster.
as scope.
The feature is based on the existing load_and_stream paths, so it
requires passing --load-and-stream to the refresh command.
Also, it is not compatible with the --primary-replica-only option.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#23861
Currently, the `system.compaction_history` table miss information like the type of compaction (cleanup, major, resharding, etc), the sstable generations involved (in and out), shard's id the compaction was triggered on and statistics on purged tombstones to be collected during compaction.
The series extends the table with the following columns:
- "compaction_type" (text)
- "shard_id" (int)
- "sstables_in" (list<sstableinfo_type>)
- "sstables_out" (list<sstableinfo_type>)
- "total_tombstone_purge_attempt" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long)
with a user defined type `sstableinfo_type` that holds the information about sstable file
- generation (uuid)
- origin (text)
- size (long)
Additional statistics stored in the compaction_history have been incorporated in the API `/compaction_manager/compaction_history` and the `nodetool compactionhistory` command.
No backport is required. It extends the existing compaction history output.
Fixes https://github.com/scylladb/scylladb/issues/3791Closesscylladb/scylladb#21288
* github.com:scylladb/scylladb:
nodetool: Refactor of compactionhistory_operation
nodetool: Add more stats into compactionhistory output
api/compaction_manager: Extend compaction_history api
compaction: Collect tombstone purge stats during compaction
compacting_reader: Extend to accept tombstone purge statistics
mutation_compactor: Collect tombstone purge attempts
compaction_garbage_collector: Extend return type of max_purgeable_fn
compaction: Extend compaction_result to collect more information
system_keyspace: Upgrade compaction_history table
system_keyspace: Create UDT: sstableinfo_type
system_keyspace: Extract compaction_history struct
system_keyspace: Squeeze update_compaction_history parameters
compaction/compaction_manager: update_history accepts compaction_result as rvalue
Incorporate additional statistics stored in the compaction_history
system table. Depending on the requested format type, the output has
different form.
Remove unnecessary duplicated history_entry struct and instead use
extracted db::compaction_history_entry structure.
Running the cql command: select * from system.compaction_history;
prints sstable's generation type as UUID (e.g. 5a5cf800-b617-11ef-a97d-8438c36f0e31),
see generation_type::data_value() which is different than its fmt
format (e.g. 3glx_0srx_1pasg2ksepk902v8dt). Therefore, to unify
the outputs, generation_type is converted to data_value before
it is printed.
Negative load sizes don't make sense, but we've seen a case in
production, where a negative number was returned by ScyllaDB REST API,
so be prepared to handle these too.
Fixes: scylladb/scylladb#24134Closesscylladb/scylladb#24135
Any empty object of the json::json_list type has its internal
_set variable assigned to false which results in such objects
being skipped by the json::json_builder.
Hence, the json returned by the api GET//compaction_manager/compaction_history
does not contain the field `rows_merged` if a cell in the
system.compaction_history table is null or an empty list.
In such cases, executing the command `nodetool compactionhistory`
will result in a crash with the following error message:
`error running operation: rjson::error (JSON assert failed on condition 'false'`
The patch fixes it by checking if the json object contains the
`rows_merged` element before processing. If the element does
not exist, the nodetool will now produce an empty list.
Fixes https://github.com/scylladb/scylladb/issues/23540Closesscylladb/scylladb#23514
Warn about an attempt to repair tablet keysapce with nodetool repair.
A nodetool cluster repair command to repair tablet keyspaces will
be added in the following patches.
Add path constants to `test` module and use them in different test suites
instead of own dups of the same code:
- TOP_SRC_DIR : ScyllaDB's source code root directory
- TEST_DIR : the directory with test.py tests and libs
- BUILD_DIR : directory with ScyllaDB's build artefacts
Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path
provided using `--tmpdir` CLI argument. Don't use `tmpdir` name because it
mixed up with pytest's built-in fixture and `--tmpdir` option itself.
Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog`
Refactor `ResourceGather*` classes to use path from a `test` object instead of
providing it separately.
Move modes constants to `test` module and remove duplications.
Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to
`pylib.suite.base` to avoid circular imports (with little refactoring to
use `pathlib.Path` instead of `str` as paths.)
Also, in some places refactor to use f-strings for formatting.
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.
To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.
A reproducer is added to the nodetool netstats crash.
Fixes: scylladb/scylladb#23394Closesscylladb/scylladb#23395
The code currently assumes that a session has both sender and receiver
streams, but it is possible to have just one or the other.
Change the test to include this scenario and remove this assumption from
the code.
Fixes: #22770Closesscylladb/scylladb#22771
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.
Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.
In Scylla there are two options that control IO bandwidth limit -- the /storage_service/(compaction|stream)_throughput REST API endpoints. The endpoints are partially implemented and have no counterparts in the nodetool.
This set implements the missing bits and adds tests for new functionality.
Closesscylladb/scylladb#21877
* github.com:scylladb/scylladb:
nodetool: Implement [gs]etstreamthroughput commands
nodetool: Implement [gs]etcompationthroughput commands
test: Add validation of how IO-updating endpoints work
api: Implement /storage_service/(stream|compaction)_throughput endpoints
api: Disqualify const config reference
api: Implement /storage_service/stream_throughput endpoint
api: Move stream throughput set/get endpoints from storage service block
api: Move set_compaction_throughput_mb_per_sec to config block
util: Include fmt/ranges.h in config_file.hh
They exist in the original documentation, but are not yet implemented.
Now it's possible to do it.
It slightly more complex that its compaction counterpart in a sense than
get method reports megabits/s by default and has an option to convert to
MiBs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>