The test hardcoded the expected number of coordinator elections
(2, 3, 4, 5) for each phase. If a prior phase triggered an extra
election, subsequent phases would wait for a count that was already
reached or would never match.
Fix by reading the current election count before each operation and
expecting exactly one more, making each phase independent of prior
history.
Also add wait_for_no_pending_topology_transition() calls after each
coordinator election to ensure the topology state machine has fully
settled before proceeding with restarts and further operations.
Decrease the failure detector timeout (failure_detector_timeout_in_ms)
to 2000 ms on all test nodes so that coordinator crashes are detected
faster, reducing test wallclock time and timeout-related flakiness.
Enable raft_topology=trace logging on all test nodes to aid
post-failure diagnosis. Add diagnostic logging in
wait_new_coordinator_elected().
Fixes: SCYLLADB-1089
Closesscylladb/scylladb#29284
The test_incremental_repair_race_window_promotes_unrepaired_data test
was still flaky because:
1. Only coordinator changes TO servers[1] were detected, but ANY
coordinator change can trigger a residual re-repair that flushes
memtables on all replicas and marks post-repair data as repaired.
2. Even without a coordinator change, the topology coordinator can
initiate a residual re-repair when it sees tablets stuck in the
repair stage after the servers[1] restart. This re-repair
contaminates the repaired set with post-repair data, masking the
compaction-merge bug the test detects.
Fix by:
- Broadening the coordinator check from == servers[1] to != coord
- Adding re-repair detection (grep for 'Initiating tablet repair
host=') at three points: post-restart, during the compaction poll,
and after injection release
- On retry, creating a completely fresh keyspace+table via
_setup_table_for_race_window() so the new attempt starts with
clean tablet metadata uncontaminated by prior re-repairs
Fixes: SCYLLADB-1478
Move the keyspace+table setup logic for
test_incremental_repair_race_window_promotes_unrepaired_data into a
dedicated helper function _setup_table_for_race_window(). The helper
creates a fresh keyspace (unique name via create_new_test_keyspace),
the table, configures STCS min_threshold=2, inserts baseline keys,
runs repair 1, inserts keys for repair 2, and flushes.
This is a pure refactor with no behavioral change: the test function
now calls the helper once instead of inlining the setup. The
extraction enables a subsequent commit to call the helper again on
retry when a leadership transfer is detected.
Add two test cases for the new loading_cache::insert() method:
* test_loading_cache_insert verifies that insert() populates the cache
and invokes the loader exactly once per key when caching is enabled.
* test_loading_cache_insert_caching_disabled is a regression test for
SCYLLADB-1699: when the cache is constructed with expiry == 0
(caching disabled), insert() must be a no-op rather than asserting
in loading_cache::get_ptr() via caching_enabled(). The loader must
not be invoked and the cache must remain empty.
Refs SCYLLADB-1699
The only test requires SCYLLA_ENABLE_ERROR_INJECTION. In modes without it
(e.g. release) the suite was empty, so pytest exited with code 5
("no tests collected") and CI failed. Add a no-op case in that branch
so collection always yields at least one test.
Deterministic reproducer using an error injection point placed in
table_helper::insert() between cache_table_info() and execute(). The
test parks fiber A at the injection, drops the target table (evicting
the prepared_statements_cache entry), runs fiber B which nulls
_insert_stmt, then releases fiber A. Without the fix this crashes in
execute(); with the fix fiber A holds a local strong ref and proceeds.
Uses the new waiters() API to synchronize with fiber A's entry into
the injection.
- Revert the previous "test.py: fix test collection bug" commit (92c09d10) which worked around broken deduplication by filtering items without `BUILD_MODE` in `pytest_collection_modifyitems`. This approach masked the root cause and is superseded by the proper fixes below.
- Backport pytest 9.0.3's argument normalization algorithm into `test.py` to work around broken deduplication in pytest 8.3.5 ([pytest-dev/pytest#12083](https://github.com/pytest-dev/pytest/issues/12083)). Duplicate or subsumed test paths (e.g. `test/cql` and `test/cql/lua_test.cql`) are collapsed before invoking pytest. Revert when upgrading to pytest 9.x.
- Return a `DisabledFile` collector instead of an empty list in `pytest_collect_file` when all modes are disabled for a file, fixing a bug where subsequent files would not get their stash items set (`REPEATING_FILES`). Restructure `pytest_collect_file` to use a walrus operator (`if repeats := ...`) with a single `remove(file_path)` and `return collectors` at the end, eliminating the early return.
- Add `--keep-duplicates` CLI argument to bypass deduplication and forward to pytest.
- Move `RUN_ID` assignment from `pytest_collect_file` to `modify_pytest_item`. A shared `run_ids` cache (`defaultdict[tuple[str, str], count]`) is created in `pytest_collection_modifyitems` and passed to `modify_pytest_item`, keyed by `(build_mode, nodeid)` so each mode gets independent counters. This ensures unique run IDs even when `--keep-duplicates` causes the same file to be collected multiple times.
- Fix `--repeat` option default from string `"1"` to int `1` — argparse only applies `type=` to CLI-parsed values, not defaults.
pytest normally deduplicates overlapping test arguments — e.g. `test/cql test/cql/lua_test.cql` collects `lua_test.cql` only once. The original `test.py` never performed this deduplication, and the pytest version in the toolchain image (8.3.5) has a bug that breaks it ([pytest-dev/pytest#12083](https://github.com/pytest-dev/pytest/issues/12083).)
Since we are moving to bare pytest, `test.py` should match pytest's default behavior: deduplicate. Because we cannot easily upgrade pytest, commit 2 backports the deduplication logic from pytest 9.0.3.
To match pytest's interface, `--keep-duplicates` is added as an opt-out. This lets a user intentionally run overlapping paths — e.g. `./test.py test/blah test/blah/test_foo.py --keep-duplicates` runs `test_foo.py` twice. The flag is forwarded to pytest and also skips the backported deduplication in `test.py`.
- Revert 92c09d10 which filtered items without `BUILD_MODE` in `pytest_collection_modifyitems` and added an early return in `CppFile.collect()`. This workaround is superseded by the proper deduplication and `DisabledFile` fixes.
- Add `_CollectionArgument` dataclass (`order=True`, `__contains__` for subsumption) and `_deduplicate_test_args()` function, adapted from pytest 9.0.3. Marked with a TODO to remove once we update to pytest 9.x.
- Call `_deduplicate_test_args()` on `options.name` before passing to pytest.
- Add `DisabledFile(pytest.File)` that skips collection with an informative message instead of returning an empty list.
- Restructure `pytest_collect_file` to use walrus operator: `if repeats := ...:` / `else:` — single `remove(file_path)` at end, no early return.
- Add `--keep-duplicates` argument that skips deduplication and is forwarded to pytest.
- Create a shared `run_ids` cache in `pytest_collection_modifyitems` and pass it to `modify_pytest_item`, which assigns unique sequential RUN_IDs via `itertools.count`. The cache is keyed by `(build_mode, nodeid)` so each mode gets independent counters.
- Remove `RUN_ID` from `_STASH_KEYS_TO_COPY` — it is no longer set on collectors.
- Remove `CppFile.run_id` cached_property. `CppTestCase` now reads `RUN_ID` from its own item stash.
- Fix `--repeat` option default from `"1"` to `1` and drop redundant `int()` cast.
Closes SCYLLADB-1730
Closesscylladb/scylladb#29665
* github.com:scylladb/scylladb:
test: add --keep-duplicates and assign RUN_ID via shared cache
test/pylib/runner: fix disabled file collection
test.py: deduplicate CLI test arguments before passing to pytest
Revert "test.py: fix test collection bug"
lock_tables_metadata() acquires a write lock on tables_metadata._cf_lock
on every shard. It used invoke_on_all(), which dispatches lock
acquisitions to all shards in parallel via parallel_for_each +
smp::submit_to.
When two fibers call lock_tables_metadata() concurrently, this can
deadlock. parallel_for_each starts all iterations unconditionally:
even when the local shard's lock attempt blocks (because the other
fiber already holds it), SMP messages are still sent to remote shards.
Both fibers' lock-acquisition messages land in the per-shard SMP
queues. The SMP queue itself is FIFO, but process_incoming() drains
it and schedules each item as a reactor task via add_task(), which —
in debug and sanitize builds with SEASTAR_SHUFFLE_TASK_QUEUE — shuffles
each newly added task against all pending tasks in the same scheduling
group's reactor task queue. This means fiber A's lock acquisition can
be reordered past fiber B's (and past unrelated tasks) on a given shard.
If fiber A wins the lock on shard X while fiber B wins on shard Y, this
creates a classic cross-shard lock-ordering deadlock (circular wait).
In production builds without SEASTAR_SHUFFLE_TASK_QUEUE, the reactor
task queue is FIFO. Still, even in release builds, the SMP queues can
reorder messages even, so the deadlock is still possible, even if it's
much less likely. In debug and sanitize builds, the task-queue shuffle
makes the deadlock very likely whenever both fibers' lock-acquisition
tasks are pending simultaneously in the reactor task queue on any shard.
This deadlock was exposed by ce00d61917 ("db: implement large_data
virtual tables with feature flag gating", merged as 88a8324e68),
which introduced legacy_drop_table_on_all_shards as a second caller
of lock_tables_metadata(). When LARGE_DATA_VIRTUAL_TABLES is enabled
during topology_state_load (via feature_service::enable), two fibers
can race:
1. activate_large_data_virtual_tables() — calls
legacy_drop_table_on_all_shards() which calls
lock_tables_metadata() synchronously via .get()
2. reload_schema_in_bg() — fires as a background fiber from
TABLE_DIGEST_INSENSITIVE_TO_EXPIRY, eventually reaches
schema_applier::commit() which also calls lock_tables_metadata()
If both reach lock_tables_metadata() while the lock is free on all
shards, the parallel acquisition creates the deadlock opportunity.
The deadlock blocks topology_state_load() from completing, which
prevents the bootstrapping node from finishing its topology state
transitions. The coordinator's topology coordinator then waits for
the node to reach the expected state, but the node is stuck, so
eventually the read_barrier times out after 300 seconds.
Fix by acquiring the shard 0 lock first before attempting to
acquire any other lock. Whichever fiber wins shard 0 is
guaranteed to acquire all remaining shards before the other fiber
can proceed past shard 0, eliminating the circular-wait condition.
Tested manually with 2 approaches:
1. causing different shard locks to be acquired by different
lock_tables_metadata() calls by adding different sleeps depending
on the lock_tables_metadata() call and target shard - this reproduced
the issue consistently
2. matching the time point at which both fibers reach lock_tables_metadata()
adding a single sleep to one of the fibers - this heavily depends on
the machine so we can't create a universal reproducer this way, but
it did result in the observed failure on my machine after finding the
right sleep time
Also added a unit test for concurrent lock_tables_metadata() calls.
Fixes: SCYLLADB-1694
Fixes: SCYLLADB-1644
Fixes: SCYLLADB-1684
Closesscylladb/scylladb#29678
This series addresses three problems in the audit startup/shutdown
sequence:
1. [BUG] Shutdown SIGABRT. During graceful shutdown, deferred stops run in reverse order of construction. With the audit service constructed after the maintenance socket, audit was destroyed first, and in-flight queries on the maintenance socket could hit the destroyed audit service (assertion failure in sharded::local()).
2. [BUG] Startup audit bypass. The maintenance socket opened before audit storage was initialized, allowing queries (e.g. creating a superuser) to bypass auditing in that window.
3. [PROBLEM] Blocks SCYLLADB-1430. The existing order prevents audit configuration from being driven by group0 state, because audit started before group0.
The series is organized as: a test-helper refactor, a test for the audited maintenance-socket flow, a startup-phase split, the construction-order fix and its shutdown-race test, and finally the storage-before-socket fix and its startup-window test.
Fixes SCYLLADB-1615
No backport, bugs don't seem severe enough to justify backporting.
Closesscylladb/scylladb#29539
* github.com:scylladb/scylladb:
audit: assert storage ordering invariants at runtime
audit: start maintenance socket after audit storage
audit: move audit construction before maintenance socket
audit: split startup into construction and storage phases
test: audit: verify maintenance socket operations are audited
test: audit: parameterize source address in audit assertions
Commit 8d34127684 ("sstables: clean up TemporaryHashes file in wipe()")
unconditionally calls filename(..., component_type::TemporaryHashes)
inside filesystem_storage::wipe(). However, the TemporaryHashes
component is only registered in the component map of the 'ms' sstable
format. For older formats (ka, la, mc, md, me) the lookup goes through
sstable_version_constants::get_component_map(version).at(...) and throws
std::out_of_range.
The exception is then swallowed by the outer catch(...) in wipe(), which
just logs and ignores. As a side effect, the subsequent
remove_file(new_toc_name) is never reached and the TemporaryTOC
('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink()
of a non-'ms' sstable.
Guard the lookup with get_component_map(version).contains() so the
cleanup is only attempted for formats that actually define the
component.
Add a regression test in test/boost/sstable_directory_test.cc that
creates an 'me'-format sstable, unlinks it and asserts that the sstable
directory is left empty. Without the fix the test fails with a leftover
'me-...-TOC.txt.tmp' file.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1697Closesscylladb/scylladb#29620
`ScyllaCluster.nodelist()` creates new `ScyllaNode` objects on every call,
so per-node state set via `set_smp()`, `set_log_level()`, and
`_adjust_smp_and_memory()` was lost. This meant `set_smp()` had no effect
when `cluster.start()` was called after it, since `start_nodes()` calls
`nodelist()` internally which creates fresh nodes with default values.
- Add debug logging for smp/memory in ScyllaNode
- Store per-node settings (smp, memory, log levels) in a
`ScyllaCluster._node_resources` dict keyed by server_id, so they survive
`nodelist()` reconstruction. `ScyllaNode` restores its state from this dict
on construction and saves it back whenever `set_smp()`, `set_log_level()`,
or `_adjust_smp_and_memory()` modifies it.
- Add a reproducer test verifying `set_smp()` takes effect on restart
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1629
--
No backport needed: this only fixes dtest infrastructure, no production code
is affected.
Closesscylladb/scylladb#29549
* github.com:scylladb/scylladb:
test/cluster/dtest: add test for node.set_smp() persistence
test/cluster/dtest: cache ScyllaNode instances in ScyllaCluster
test/cluster/dtest/ccmlib/scylla_node: add debug logging
Add --keep-duplicates CLI argument to bypass deduplication and forward
to pytest, allowing duplicate test file arguments to be collected
multiple times.
Move RUN_ID assignment from pytest_collect_file to modify_pytest_item.
All File collectors for the same source file share a single run_ids
dict (via RUN_ID_CACHE stash key), so items from duplicate collection
arguments (e.g. with --keep-duplicates) automatically get unique IDs.
Remove CppFile.run_id cached_property — CppTestCase now reads RUN_ID
from its own item stash, which is set during modify_pytest_item.
Fix --repeat option default from string "1" to int 1 — argparse only
applies type= to CLI-parsed values, not defaults.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
Return a DisabledFile collector instead of an empty list when all modes
are disabled for a file. Returning an empty list caused subsequent
files to not get their stash items set because file_path was never
removed from REPEATING_FILES.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
The table-based audit backend needs Raft to create its keyspace,
but the audit service must exist earlier so that CQL paths don't
silently skip auditing.
Split startup into two phases: construction and storage
initialization. Queries arriving between the two phases are
logged as errors.
This is a refactoring commit and the split sections will be
moved later in this patch series.
Refs SCYLLADB-1615
User creation via the maintenance socket should produce audit
entries, as this is the recommended flow for creating the
initial superuser when default credentials are disabled.
The test is parametrized by audit backend (table and syslog).
The maintenance socket source address is "::" because Seastar
returns a zero-initialised in6_addr for AF_UNIX sockets.
Test time in dev: 0.6s
Refs SCYLLADB-1615
Some tests in test_tablets.py read system_schema.keyspaces from an arbitrary node that may not have applied the latest schema change yet. Pin the read to a specific node and issue a read barrier before querying, ensuring the node has up-to-date data.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1700
Test fix; no backport
Closesscylladb/scylladb#29655
* github.com:scylladb/scylladb:
test: fix flaky rack list conversion tests by using read barrier
test: fix flaky test_enforce_rack_list_option by using read barrier
Parallelize SSTable creation using parallel_for_each. The file
count is made a parameter with a default of 64, allowing future
S3/GCS variants to use a smaller count if needed.
Parallelize SSTable creation using parallel_for_each and reduce
the SSTable count from 256 to 64 for S3/GCS variants. The local
test variant retains the original 256 count.
Parallelize SSTable creation across all sub-tests using
parallel_for_each and reduce the SSTable count from 256 to 64 for
S3/GCS variants.
Re-enable the S3 test variant that was previously disabled due to
taking 4+ minutes. With parallel creation and reduced count, the
test now completes in a reasonable time.
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations.
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations that depend on SSTable order.
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
Raise log levels for s3 and gcp_storage from debug to trace, and add
trace-level logging for http and default_http_retry_strategy modules.
This provides better visibility into storage backend interactions
when debugging slow or failing compaction tests on remote storage.
The original make_memtable used seastar::thread::yield() for
preemption, which required all callers to run inside a
seastar::thread context. This prevented the utilities from being
used directly in coroutines or parallel_for_each lambdas.
Make the primary functions — make_memtable, make_sstable_containing,
and verify_mutation — return future<> directly. Callers now .get()
explicitly when in seastar::thread context, or co_await when in
a coroutine.
make_memtable now uses coroutine::maybe_yield() instead of
seastar::thread::yield(). verify_mutation is converted to
coroutines as well.
Requested in:
https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
test_exception_safety_of_update_from_memtable called make_memtable
inside the row_cache::external_updater callback. external_updater
runs as a synchronous execute() call that must not yield, but
make_memtable calls seastar::thread::yield() every 10th mutation.
The bug was latent because the test only inserted 5 mutations, so
the yield was never reached. Move the call before the callback.
Prerequisite for the next patch, which changes make_memtable to
call make_memtable_async().get() -- that would yield on every
mutation via coroutine::maybe_yield(), making this bug visible.
Increase max_connections from the default to 32 for the S3 endpoint
used in tests. This allows more concurrent HTTP connections to the S3
backend, which is needed to benefit from parallel SSTable creation
that will be introduced in subsequent commits.
Commit 2b7aa32 (topology_coordinator: Refresh load stats after
table is created or altered) registered topology_coordinator as a
schema change listener and added on_create_column_family which
fire-and-forgets _tablet_load_stats_refresh.trigger(). The
triggered task runs on the gossip scheduling group via
with_scheduling_group and accesses the topology_coordinator via
'this'.
stop() unregisters the listener but does not wait for any
in-flight refresh task. If a notification fires between
_tablet_load_stats_refresh.join() in run() and unregister_listener
in stop(), the scheduled task can outlive the topology_coordinator
and access freed memory after run_topology_coordinator's coroutine
frame is destroyed.
Wait for the refresh to complete in stop() after unregistering the
listener, ensuring no task can fire after destruction.
Fixes SCYLLADB-1728
Backport to 2026.1 and 2026.2, because the issue was introduced in 2b7aa32Closesscylladb/scylladb#29653
* https://github.com/scylladb/scylladb:
test: tablet_stats: reproduce shutdown refresh race
topology_coordinator: join tablet load stats refresh in stop()
Add a test that reproduces SCYLLADB-1629: set_smp() had no effect
because nodelist() created new ScyllaNode objects on every call,
losing the _smp_set_during_test value. The test fails without the
fix in the previous patch.
ScyllaCluster.nodelist() was creating new ScyllaNode objects on every
call, so per-node state set via set_smp(), set_log_level(), and
_adjust_smp_and_memory() was lost between calls.
Fix by caching ScyllaNode instances in a list populated by
_add_nodes() using the list returned by servers_add() in populate().
Nodes are assigned monotonically increasing names (node1, node2, ...).
nodelist() simply returns the cached list.
The LDAP role manager's `_cache_pruner` background fiber periodically calls cache::reload_all_permissions(). Two races cause it to hit SCYLLA_ASSERT(_permission_loader):
- Cross-shard race: The pruner `used _cache.container().invoke_on_all()` to reload permissions on every shard. Since both `service::start()` and `sharded<service>::stop()` execute per-shard in parallel, the pruner on one shard could call reload_all_permissions() on another shard before that shard set its loader (startup) or after it cleared its loader (shutdown). Each shard runs its own pruner instance, so reloading locally is sufficient — this also removes redundant N² reload calls.
- Intra-shard race: `service::stop()` cleared the permission loader and stopped the role manager concurrently (via when_all_succeed). A mid-reload pruner could yield and then call the now-null loader. Fixed by stopping the role manager first so the pruner is fully drained before the loader is cleared.
Fixes SCYLLADB-1679
Backport to 2026.2, introduced in 7eedf50c12Closesscylladb/scylladb#29605
* github.com:scylladb/scylladb:
auth: make shutdown the exact reverse of startup
test: ldap: add test for pruner crash during shutdown
auth: start authorizer and set permission loader before role manager
auth: stop role manager before clearing permission loader
auth: reload LDAP permission cache on local shard only
In certain circumstances current way of collecting can be error-prone. Collection can stop when the first file is skipped in the mode leaving the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file explicitly, it will produce incorrect CppFile in the stash causing KeyError.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714
No backport, test framework bug fix only.
Closesscylladb/scylladb#29634
* github.com:scylladb/scylladb:
test.py: fix framework test
test.py: fix test collection bug
The coordinator can receive a schema-change notification after run()
finishes but before stop() unregisters listeners. The test pins that
window with error injections and verifies stop() waits for the refresh
instead of letting it outlive the coordinator.
Test time in dev: 9.51s
Refs SCYLLADB-1728
There is a small race window where Raft leadership could transfer back
to servers[1] between the ensure_group0_leader_on() check and the
actual restart. If this happens, the new coordinator re-initiates
repair and masks the compaction-merge bug.
Extract the core test logic into _do_race_window_promotes_unrepaired_data()
which directly checks get_topology_coordinator() after restart and raises
_LeadershipTransferred if servers[1] became coordinator. The test
function calls this helper in a retry loop (up to 5 attempts).
Refs: SCYLLADB-1478
The test_incremental_repair_race_window_promotes_unrepaired_data test
was flaky because it hardcodes servers[1] as the restart target but did
not ensure servers[1] was NOT the topology coordinator.
When servers[1] happened to be the Raft group0 leader (topology
coordinator), restarting it killed the leader, forced a new election,
and the new coordinator re-initiated tablet repair. This re-repair
flushes memtables on all replicas via take_storage_snapshot() and marks
the resulting sstables as repaired -- causing post-repair keys to appear
in repaired sstables on servers[0] and servers[2]. The test then hit
the wrong assertion (servers[0]/[2] contaminated).
Fix: before starting the repair, check whether servers[1] is the
topology coordinator. If so, move leadership to another server via
ensure_group0_leader_on() so that restarting servers[1] only kills a
follower -- which does not trigger an election or coordinator change.
Reproducibility was confirmed by forcing leadership to servers[1] via
ensure_group0_leader_on() and observing deterministic failure with all
three servers showing post-repair keys in repaired sstables (confirming
the re-repair scenario), then verifying the fix passes reliably.
Fixes: SCYLLADB-1478
test_numeric_rf_to_rack_list_conversion and
test_numeric_rf_to_rack_list_conversion_abort were reading
system_schema.keyspaces from an arbitrary node that may not have
applied the latest schema change yet. Pin the read to a specific node
and issue a read barrier before querying, ensuring the node has
up-to-date data.
The test was reading system_schema.keyspaces from an arbitrary node
that may not have applied the latest schema change yet. Pin the read
to a specific node and issue a read barrier before querying, ensuring
the node has up-to-date data.
Add test_load_balancing_with_dropped_table that simulates the race between
DROP TABLE and the load balancer by capturing a token metadata snapshot
before dropping the table, then passing the stale snapshot to
balance_tablets(). Verifies it completes without aborting and produces no
migrations for the dropped table.
server_add() only waits for CQL readiness before returning. The
Alternator HTTP port may not be listening yet, causing
ConnectionRefused with Alternator tests.
Extend the ServerUpState enum and startup loop to also check Alternator
port readiness when configured. Whenever Alternator port(s) is/are
configured, each is verified if connectable and queryable,
similar to how CQL ports are probed.
Fixes SCYLLADB-1701
Closesscylladb/scylladb#29625
Commit ce00d61917 ("db: implement large_data virtual tables with feature
flag gating") changed these two tests to construct their mutation with
a randomly generated partition key (simple_schema::make_pkey()) instead
of the previously fixed pk "pv", with the comment that this avoids a
"Failed to generate sharding metadata" error.
simple_schema::make_pkey() delegates to tests::generate_partition_key(),
which defaults to key_size{1, 128}, i.e. the partition key length is
uniformly random in [1, 128] bytes. That interacts badly with the fact
that both tests pick thresholds at exact byte boundaries of the MC
sstable row encoding:
- The large-data handler records a row's size as
_data_writer->offset() - current_pos
(sstables/mx/writer.cc: collect_row_stats()), i.e. the number of
bytes the row took on disk.
- For the first clustering row, the body includes a vint-encoded
prev_row_size = pos - _prev_row_start.
- _prev_row_start is captured at the start of the partition
(consume_new_partition()) before the partition key is written to
the data stream, so prev_row_size rolls in the partition key's
serialized length (2-byte prefix + pk bytes) + deletion_time +
static row size.
A random-size partition key therefore perturbs the first clustering
row's encoded size by 1-2 bytes across runs (the vint of prev_row_size
crosses the 128 boundary), flipping the test's byte-exact threshold
comparison. On seed 2104744000 this produced:
critical check row_size_count == expected.size() has failed [3 != 2]
Fix the two byte-exact-sensitive tests by reverting their partition key
to the fixed s.new_mutation("pv") used before ce00d61917. Under smp=1
(which these tests run with, per -c1 in the test invocation) a fixed
key is always shard-local, so no sharding-metadata issue arises here.
The other tests modified by ce00d61917 (test_sstable_log_too_many_rows,
test_sstable_log_too_many_dead_rows, test_sstable_too_many_collection_elements,
test_large_data_records_round_trip, etc.) assert on row/element counts
or use thresholds with enough slack that the partition key size does
not matter, and are left unchanged.
Add an explanatory comment to each fixed site so the pitfall is not
re-introduced by a future refactor.
Verified stable with:
./test.py --mode=dev test/boost/sstable_3_x_test.cc::test_sstable_write_large_row --repeat 100 --max-failures 1
./test.py --mode=dev test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1
./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_row --repeat 100 --max-failures 1
./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1
All four invocations: 100/100 passed.
Fixes: SCYLLADB-1685
Closesscylladb/scylladb#29621
In certain circumstances current way of collecting can be error prone.
Collection can stop when the first file is skipped in the mode leaving
the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file
explicitly, it will produce incorrect CppFile in the stash causing
KeyError.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714
Add a node_owner column (locator::host_id) to system.sstables and
make it part of the partition key, so the primary key becomes
PRIMARY KEY ((table_id, node_owner), generation).
This is the first step toward moving the sstables registry into
system_distributed: once distributed, each node's startup scan
must read only the rows it owns, which requires the owning node
to be part of the partition key. Partitioning by (table_id,
node_owner) turns that scan into a single-partition read of
exactly the local node's rows.
The new column is populated via sstables_manager::get_local_host_id().
No backward compatibility is preserved; the feature is experimental
and gated by keyspace-storage-options.
The partition-key column in system.sstables named 'owner' actually
holds a table_id. Rename the CQL column and the matching C++
parameter and member names so the identifier describes what it
stores. No behavior change.
This prepares the schema for an upcoming node_owner partition-key
column (the local host id), which needs a free name.
Drop storage_proxy's own updateable_timeout_config member built from
db::config and take a reference to the shared sharded instance
introduced by the previous patch. Both main and cql_test_env pass
std::ref(timeout_cfg) into storage_proxy::start so each shard's
storage_proxy references its shard-local updateable_timeout_config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build a single sharded updateable_timeout_config from db::config in
both main and cql_test_env, sitting next to sharded<cql_config>.
Subsequent patches migrate storage_proxy, the CQL transport controller
and alternator server from their per-owner updateable_timeout_config
copies to references into this shared instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verify that service::stop() drains the LDAP pruner before
clearing the permission loader. The test installs a slow
permission loader and confirms the pruner is actively
reloading when teardown begins.
Refs SCYLLADB-1679
Pass node_update_backlog explicitly to view_update_generator via its
constructor and start() call. This is plumbing only; no behavior change.
A subsequent patch will use this reference to compute view update
throttling delays without going through database::get_config().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>