Fix two format string bugs:
- gcp/object_storage.cc: _session_path was passed but the format
string had empty parentheses () instead of ({}), so the session
path was silently dropped from the debug output.
- s3/client.cc: part_number was passed as an argument but had no {}
placeholder. The upload_id ended up in the etag slot and was
silently dropped. Add {} for all three values.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The FSM state was passed as an argument but the format string had
empty parentheses () instead of ({}), causing the FSM state to be
silently dropped from the output.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The collection type name was passed as an argument but the format
string only had a trailing colon without a {} placeholder, so the
type name was silently dropped from the error message.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The values count was passed as an argument but had no {} placeholder,
so it was silently dropped. The analogous BETWEEN check on the line
above correctly uses {} -- apply the same pattern here.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix three format string bugs:
- partition_reversing_data_source.cc: _row_start was passed as an
argument but had no {} placeholder in the invariant error message.
Add {} for all three values to show the full diagnostic.
- reader.cc: two "Invalid boundary type" error messages passed the
type value as an argument but had no {} placeholder, so the actual
invalid type was never shown.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Four logger calls used %s (printf-style) instead of {} (fmtlib-style),
causing __func__ to be silently ignored and the literal text "%s" to
appear in the log output. The same file already uses {} correctly in
the on_create_function and on_create_aggregate handlers.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix two format string bugs where arguments were silently dropped:
- prepare_expr.cc: the bad argument to count() was passed but had no
{} placeholder, so users never saw what was actually passed.
- statement_restrictions.cc: the unsupported multi-column relation was
passed but the trailing colon had no {} placeholder.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix six format string bugs where arguments were silently dropped:
- heat_load_balance.cc: pp value was passed but had no {} placeholder.
- commitlog_replayer.cc: column_family_id was passed but table= had
no {} placeholder.
- view_update_generator.cc: _sstables_with_tables.size() was passed
but had no {} placeholder.
- view_building_worker.cc: exception pointer was passed but the
trailing colon had no {} placeholder.
- row_locking.cc: partition key and clustering key were passed in
error messages but had no {} placeholders.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix four format string bugs:
- raft_group0.cc: the exception from sleep_and_abort was passed as an
argument but had no {} placeholder, so it was silently dropped.
- storage_service.cc: loading topology trace was missing a placeholder
for the cleanup field (9 args but only 8 placeholders).
- storage_service.cc: two join-rejection warnings had a spurious comma
after the first string literal, breaking C++ string concatenation.
This caused the continuation string to be treated as a separate
format argument instead of being part of the format string, and
params.host_id was silently dropped.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The log message for tablet cleanup invalidation was missing a {}
placeholder for the table name (cf_name), causing it to be silently
dropped from the output. Add {}.{} to show both keyspace and table
name, consistent with the convention used elsewhere in the file.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The function database::create_local_system_table calls
get_tables_metadata().hold_write_lock(), but does not co_await the
returned future. Effectively, this code does not guarantee mutual
exclusion because it does not wait for the lock to be acquired and does
not guarantee that the lock is held long enough.
Fix this by adding the co_await that was missing.
Found by manual inspection. This code is not known to have caused any
problems so far, but it's clearly wrong - hence the fix.
Closesscylladb/scylladb#29806
Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation.
No backport needed since this removes functionality.
Closesscylladb/scylladb#29482
* github.com:scylladb/scylladb:
db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused
db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2
db/system_distributed_keyspace: remove unused code
db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table
db/system_distributed_keyspace: drop old service_levels table
fix indent after the previous patch
group0: call setup_group0 only when needed
interval switched from std::optional<> to union + bools for bound
storage in 42d7ae1082.
Update the printer to work with the new layout. Keep the code
backwards compatible, 2025.1 still uses optionals and is still
supported.
Closesscylladb/scylladb#29738
Fix 4 genuine CQL syntax errors in documentation examples, found by automated extraction and execution of doc code blocks against a live ScyllaDB instance.
- **insert.rst**: `USING TTL 86400 IF NOT EXISTS` → `IF NOT EXISTS USING TTL 86400` (wrong clause order produces syntax error)
- **ddl.rst**: Missing opening quote in ALTER KEYSPACE example (`dc2'` → `'dc2'`)
- **ddl.rst**: Hyphenated column names need double-quoting; also fix PRIMARY KEY referencing non-existent `customer_id` instead of `cust_id`
- **types.rst**: UDT `address` contains nested collections, so it must be `frozen<address>` when used as a column type
Built a CQL extractor that parses `.. code-block:: cql` blocks from RST docs, then executed all 194 extracted statements against ScyllaDB 2026.2.0-rc0. These 4 are confirmed syntax/semantic errors in the documentation.
Closesscylladb/scylladb#29765
* github.com:scylladb/scylladb:
test/cqlpy: add tests for hyphenated column names
docs/cql: fix UDT example to use frozen<address>
docs/cql: fix CREATE TABLE example with hyphenated column names
docs/cql: fix missing opening quote in ALTER KEYSPACE example
docs/cql: fix INSERT example clause order (IF NOT EXISTS before USING)
Prepare for truncate of tables on object storage, where we want to
limit the atomic deletion batches to produce smaller batch mutations.
This is safe since truncate does not really need to delete all sstables
in the table atomically — it is already non-atomic since each node and
each shard deletes its own sstables. The atomic deletion mechanism is
used for convenience.
Previously, discard_sstables collected all sstables from all compaction
groups on the shard into a single vector and issued one atomic delete
for all of them. Change to track removed sstables per compaction group
and issue separate atomic deletes per group using
coroutine::parallel_for_each, allowing concurrent deletion across
groups.
Closesscylladb/scylladb#29789
The BTI partition index trie writer flushes all buffered nodes at the
end of each SSTable via complete_until_depth(0), called from
bti_partition_index_writer_impl::finish(). This is a tight synchronous
loop that writes trie nodes through file_writer::write(), which uses a
buffered output_stream: individual writes that fit in the buffer are
plain memcpy operations returning a ready future, so .get() never
yields. As a result the reactor can stall for several milliseconds on
large SSTables.
The entire call chain runs inside seastar::async() (via
sstable::write_components()), so seastar::thread::maybe_yield() is
safe to call here. Add it at the top of both tight loops:
- complete_until_depth(), which iterates over trie depth
- lay_out_children(), which iterates over child branches per node
Fixes SCYLLADB-1885
Closesscylladb/scylladb#29798
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.
Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.
Fixes SCYLLADB-1756
Previously, when a snapshot load subsumed a committed entry before apply()
was called locally, add_entry would throw commit_status_unknown -- even
though the entry was known to be committed and included in the snapshot.
This was overly pessimistic. Normal state machine implementations
shouldn't care whether an entry was applied via apply() or via a snapshot load.
Unnecessary commit_status_unknown caused flakiness of
test_frequent_snapshotting and unnecessary retries in group0. Raft groups
from strongly consistent tables couldn't hit unnecessary
commit_status_unknown's because they use wait_type::committed and
`enable_forwarding == false`.
Three sites are changed:
1. wait_for_entry (truncation case): the snapshot-term match optimization
that proved the entry was committed now applies to both wait_type::committed
and wait_type::applied, not just committed.
2. wait_for_entry (snapshot covers entry): instead of throwing
commit_status_unknown when the snapshot index >= entry index, return
successfully. The entry's effects are included in the state machine's
state via the snapshot.
3. drop_waiters: when called from load_snapshot, pass the snapshot term.
Waiters whose term matches the snapshot term are resolved successfully
(set_value) instead of failing with commit_status_unknown, since the
Log Matching Property guarantees they were committed and included.
This deflakes test_frequent_snapshotting: the test uses aggressive
snapshot settings (snapshot_threshold=1) causing wait_for_entry to
occasionally find the snapshot covering its entry. Previously this
threw commit_status_unknown, failing the test. With this fix,
wait_for_entry returns success. Note that apply() is never actually
skipped in this test -- the leader always applies entries locally
before taking a snapshot.
The nemesis test is updated to handle the new behavior:
call() detects when add_entry succeeded but the output channel was
not written (apply() skipped locally) and returns apply_skipped instead
of hanging. The linearizability checker in basic_generator_test counts
skipped applies separately from failures. basic_generator_test
exercises this path: skipped_applies > 0 occurs in some runs.
Fixes: SCYLLADB-1264
No backport: the changes are quite risky and the test being fixed
fails very rarely.
Closesscylladb/scylladb#29685
* github.com:scylladb/scylladb:
test/raft: fix duplicate check in connected::operator()
test/raft: add tests for add_entry snapshot interactions
raft: do not throw commit_status_unknown from add_entry when possible
raft: change drop_waiters parameter from index to snapshot descriptor
raft: server: fix a typo
Add `test_fulltext_index.py` covering the `fulltext_index` custom index:
- Creation on text, varchar, and ascii columns
- Rejection of non-text types (int, blob, vector)
- Validation of analyzer and positions options
- Rejection of unsupported option keys
- Case-insensitive class name lookup
- DESCRIBE INDEX output with and without options
- No backing materialized view in `system_schema.views`
- IF NOT EXISTS idempotent behavior
- Metadata correctness in `system_schema.indexes`
Move common description logic into a protected helper
`describe_with_target` on `custom_index`, so subclasses can delegate
to it when implementing the `describe()` virtual method.
Introduce `fulltext_index`, a new `custom_index` subclass
for full-text search (FTS).
The index validates that the target column is a text type
(text, varchar, or ascii) and supports two WITH OPTIONS keys:
- 'analyzer': one of standard, english, german, french, spanish,
italian, portuguese, russian, chinese, japanese, korean, simple,
whitespace
- 'positions': boolean controlling whether term positions are stored
`view_should_exist()` returns false — no backing materialized view is
created, matching the CDC-backed pattern used by `vector_index`.
Fixes: SCYLLADB-1517
Move `validate_enumerated_option`, `validate_positive_option`,
and `validate_factor_option` into shared index option utilities
under the `secondary_index::util` namespace. These functions were
previously defined as file-local statics in `vector_index.cc` with
hardcoded index names in error messages.
The shared versions take `index_type_name` as a parameter, allowing
each `custom_index` subclass to pass its own name via the virtual
`index_type_name()` method at the call site. The options maps use
`std::bind_front` to bind config params (supported values, limits),
leaving `index_name` as the first unbound argument passed by
`check_index_options()`.
Add `index_type_name()` as a pure virtual method on `custom_index`.
Move the shared utility implementations into `index_option_utils.cc`
and update `vector_index.cc` to use them.
The operator had a copy-paste bug: it checked
disconnected.contains({id1, id2}) twice instead of checking both
directions ({id1, id2} and {id2, id1}).
Reduce the operator to a single directional check: {id1, id2}. It works
for all current callers, and checking both directions correctly would
break the new block_receive() function.
Add six tests covering add_entry with wait_type::applied and
wait_type::committed for three snapshot scenarios affected in the
previous commit:
1. Snapshot at the entry's index (wait_for_entry, term_for returns
snapshot term).
2. Snapshot past the entry's index (wait_for_entry, term_for returns
nullopt).
3. Follower's waiter is resolved via drop_waiters when a snapshot
is loaded.
Without the fix in the previous commit, 4 of 6 tests fail:
all 3 wait_type::applied tests and the wait_type::committed
drop_waiters test. The remaining two tests pass because the changes
don't affect them.
We don't write tests covering the scenarios when add_entry should
still throw commit_status_unknown (that is when the entry's term
doesn't match the snapshot's term) because:
- these tests would be very complicated,
- a bug that would make these tests fail should also make the
nemesis tests fail, as there would be an issue with linearizability.
Previously, when a snapshot load subsumed a committed entry before apply()
was called locally, add_entry would throw commit_status_unknown -- even
though the entry was known to be committed and included in the snapshot.
This was overly pessimistic. Normal state machine implementations
shouldn't care whether an entry was applied via apply() or via a snapshot load.
Unnecessary commit_status_unknown caused flakiness of
test_frequent_snapshotting and unnecessary retries in group0. Raft groups
from strongly consistent tables couldn't hit unnecessary
commit_status_unknown's because they use wait_type::committed and
`enable_forwarding == false`.
Three sites are changed:
1. wait_for_entry (truncation case): the snapshot-term match optimization
that proved the entry was committed now applies to both wait_type::committed
and wait_type::applied, not just committed.
2. wait_for_entry (snapshot covers entry): instead of throwing
commit_status_unknown when the snapshot index >= entry index, return
successfully. The entry's effects are included in the state machine's
state via the snapshot.
3. drop_waiters: when called from load_snapshot, pass the snapshot term.
Waiters whose term matches the snapshot term are resolved successfully
(set_value) instead of failing with commit_status_unknown, since the
Log Matching Property guarantees they were committed and included.
This deflakes test_frequent_snapshotting: the test uses aggressive
snapshot settings (snapshot_threshold=1) causing wait_for_entry to
occasionally find the snapshot covering its entry. Previously this
threw commit_status_unknown, failing the test. With this fix,
wait_for_entry returns success. Note that apply() is never actually
skipped in this test -- the leader always applies entries locally
before taking a snapshot.
The nemesis test is updated to handle the new behavior:
call() detects when add_entry succeeded but the output channel was
not written (apply() skipped locally) and returns apply_skipped instead
of hanging. The linearizability checker in basic_generator_test counts
skipped applies separately from failures. basic_generator_test
exercises this path: skipped_applies > 0 occurs in some runs.
Fixes: SCYLLADB-1264
Change drop_waiters(std::optional<index_t> idx) to
drop_waiters(const snapshot_descriptor* snp). The only caller that passes
an index is load_snapshot, which already has the full snapshot descriptor.
Using it directly makes the parameter self-documenting and prepares for the
following commit which will also need the snapshot term (a field of
snapshot_descriptor).
Several sstable_compaction_test cases run prohibitively slowly on S3 and GCS backends — some taking 4+ minutes — because they create hundreds of SSTables sequentially over high-latency HTTP connections and perform redundant validation (checksumming) round-trips on every one. The twcs_reshape_with_disjoint_set S3 variant was even disabled entirely because of this.
The changes apply three complementary optimizations, per-test:
**Skip SSTable validation on remote storage.** The compaction tests verify strategy logic, not data integrity. SSTable validation triggers additional read-back I/O which is cheap on local disk but expensive over HTTP. A `do_validate` flag now conditionally skips validation when the storage backend is not local.
**Parallelize SSTable creation with async coroutines.** A new `make_sstable_containing_async` coroutine overload is added alongside the existing synchronous `make_sstable_containing`. Sequential creation loops are replaced with `parallel_for_each` using coroutine lambdas that call the async overload directly, overlapping S3/GCS uploads without spawning a dedicated Seastar thread per SSTable. The async validation path performs the same content checks as the synchronous version (mutation merging and `is_equal_to_compacted` assertions). Operations that depend on the created SSTables (e.g. `add_sstable_and_update_cache`, `owned_token_ranges` population) remain sequential.
**Reduce SSTable count for remote variants.** Tests like twcs_reshape_with_disjoint_set and stcs_reshape_overlapping used a hardcoded count of 256. The count is now a function parameter (default 256 for local, 64 for S3/GCS), which is sufficient to exercise the compaction strategy logic while avoiding excessive remote I/O.
Infrastructure changes: S3 endpoint max_connections raised from the default to 32 to support the higher upload concurrency, and trace-level logging added for s3, gcp_storage, http, and default_http_retry_strategy to aid future debugging.
The previously disabled twcs_reshape_with_disjoint_set_s3_test is re-enabled with these optimizations.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1428
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1843
No backport needed — this is a test-only performance improvement.
Closesscylladb/scylladb#29416
* github.com:scylladb/scylladb:
test: optimize compaction_strategy_cleanup_method for remote storage
test: optimize stcs_reshape_overlapping for remote storage
test: optimize twcs_reshape_with_disjoint_set for remote storage
test: parallelize SSTable creation in cleanup_during_offstrategy_incremental
test: parallelize SSTable creation in run_incremental_compaction_test
test: parallelize SSTable creation in offstrategy_sstable_compaction
test: parallelize SSTable creation in twcs_partition_estimate
test: add trace-level logging for S3 and HTTP in compaction tests
test: make sstable test utilities natively async The original make_memtable used seastar::thread::yield() for preemption, which required all callers to run inside a seastar::thread context. This prevented the utilities from being used directly in coroutines or parallel_for_each lambdas. Make the primary functions — make_memtable, make_sstable_containing, and verify_mutation — return future<> directly. Callers now .get() explicitly when in seastar::thread context, or co_await when in a coroutine. make_memtable now uses coroutine::maybe_yield() instead of seastar::thread::yield(). verify_mutation is converted to coroutines as well. Requested in: https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
test: move make_memtable out of external_updater in row_cache_test
test: increase S3 max connections for compaction tests
The variable was constructed but never used — the original iterator
is returned instead. Fix the misleading comment to explain the
open-shard semantics of returning the original iterator.
Both disable_stream and wait_for_active_stream used time.process_time()
for their timeouts, but process_time measures CPU time, not wall-clock
time. Since these loops spend most of their time sleeping and waiting on
API calls, the timeouts could last far longer than intended. Use
time.time() instead to enforce actual wall-clock deadlines.
Previously, disabling Alternator Streams would create a blank
cdc::options with only enabled=false, which meant losing access also
to stored Streams's data (including preimage and postimage).
Now, when a stream is disabled:
- The existing CDC options are preserved (only 'enabled' is flipped to
false), so StreamViewType remains available.
- DescribeStream enumerates all shards with EndingSequenceNumber set,
indicating they are closed.
- GetRecords omits NextShardIterator for disabled streams.
- DescribeTable (supplement_table_stream_info) reports the stream ARN
and StreamEnabled: false when the CDC log table still exists.
- ListStreams uses get_base_table instead of is_log_for_some_table so
that disabled streams whose log table still exists are listed.
When a stream is re-enabled on an Alternator table that has an existing
(disabled) CDC log table, the old log table is dropped and a fresh one
is created with a new UUID, producing a new StreamArn. This is
Alternator-specific behavior; CQL CDC tables continue to reuse the
existing log table.
The old stream data is lost immediately upon re-enable. DynamoDB keeps
it readable for 24 hours.
Tests:
- test_streams_closed_read, test_streams_disabled_stream: remove xfail
now that disabled streams are usable.
- test_streams_reenable: new test verifying that re-enabling produces
a new ARN and the old data is still readable via the old ARN (xfail
because Scylla currently purges old data on re-enable).
Fixesscylladb/scylladb#7239
Add a Boost unit test that forces capacity-based balancing through
configuration and verifies that a drained and excluded node will be
drained of its tablets when tablet size stats are missing.
The test covers the regression where the allocator rejected the plan due
to incomplete tablet stats, even though forced capacity-based balancing
does not depend on tablet sizes.
When force_capacity_based_balancing is enabled, the tablet allocator
balances by node and shard capacity rather than by tablet sizes.
When the data needed for load balancing is incomplete, the balancer
fails and waits until load_stats is available and correct for all the
nodes. An exception to this is when a node is being drained and
excluded: it is unreachable, and will not return. In this case
the balancer has to do its best and ignore the missing data.
This patch fixes a bug where forcing capacity based balancing made the
balancer not ignore missing data in these cases, and instead abort the
balancing.
In the test_delete_partition_rows_from_table_with_mv case we perform
a deletion of a large partition to verify that the deletion will
self-throttle when generating many view updates.
Before the deletion, we first build the materialized view, which causes
the view update backlog to grow. The backlog should be back to empty
when the view building finishes, and we do wait for that to happen, but
the information about the backlog drop may not be propagated to the
delete coordinator in time - the gossip interval is 1s and we perform
no other writes between the nodes in the meantime, so we don't make use
of the "piggyback" mechanism of propagating view backlog either. If the
coordinator thinks that the backlog is high on the replica, it may reject
the delete, failing this test.
We change this in this patch - after the view is built, we perform an
extra write from the coordinator. When the write finishes, the coordinator
will have the up-to-date view backlog and can proceed with the DELETE.
Additionally, we enable the "update_backlog_immediately" injection, which
makes the node backlog (the highest backlog across shards) update immediately
after each change.
Fixes: SCYLLADB-1795
Closesscylladb/scylladb#29775
The test used a real-time sleep to move the queued permit into the
preemptive-abort window. If the reactor did not get CPU for long
enough, admission could run only after the permit's timeout had
expired, making the expected abort path flaky.
The test also exhausted memory together with count resources, so the
queued permit could wait for memory. Preemptive abort is intentionally
not applied to permits waiting for memory, so keep enough memory
available and assert that the permit is queued only on count.
Use an immediate preemptive-abort threshold and a long finite timeout
to exercise admission-time abort without relying on scheduler timing.
Fixes: SCYLLADB-1796
Closesscylladb/scylladb#29736
CreateTable and UpdateTable call wait_for_schema_agreement() after announcing the schema change, to ensure all live nodes have applied the new schema before returning to the user. This wait has a hard- coded 10 second timeout, and on some overloaded test machines we saw it not completing in time, and causing tests to become flaky.
This patch increases this timeout from 10 seconds to 30 seconds. It's still hard-coded and not configurable via alternator_timeout_in_ms because it is unlikely any user will want to change it - it just needs to be long.
The patch also improves the behavior of a schema-agreement timeout, when it happens:
1. Provide an InternalServerError with more descriptive text.
2. This InternalServerError tells the user that the result of the operation is unknown; So the user will repeat the CreateTable, and will get a ResourceInUseException because the table exists. In that case too, we need to wait for schema agreement. So we added this missing wait.
Fixes SCYLLADB-1804
Refs #5052 (claiming CreateTable shouldn't wait at all)
This patch is only important to improve test stability in extremely slow test machines where schema agreement sometimes (very rarely) takes over 10 seconds. It's not important to backport it to branches that don't run CI very often on slow machines.
Closesscylladb/scylladb#29744
* https://github.com/scylladb/scylladb:
alternator: improve CreateTable/UpdateTable schema agreement timeout
migration_manager: unique timeout exception for wait_for_schema_agreement()
In debug mode, this test can timeout during tablets merge. While the
test already decreases the number of tables in debug mode (20 tables,
instead of 200 for dev mode), this is not enough, and the test can still
timeout during merge. This change reduces the number of tables from 20
to 5 in debug mode.
It also drops the log level for lead_balancer to debug. This should make
any potential future problems with this test easier to investigate.
Fixes: SCYLLADB-1717
Closesscylladb/scylladb#29682
Replace the naive host.is_up check with wait_for_cql_and_get_hosts() which
actually executes a query against each host, ensuring the driver's connection
pool is fully re-established before proceeding to stop the last server.
The is_up flag is set asynchronously via gossip and doesn't guarantee the
connection pool has live TCP connections. After a server restart, the flag
may be True while the pool still holds stale connections. When the pool
monitor later discovers them dead it briefly marks the host DOWN, causing
NoHostAvailable if another server is being stopped concurrently.
Fixes SCYLLADB-1840
Closesscylladb/scylladb#29769
This PR includes two changes that make gossiper much less likely to mark
nodes as down in tests unexpectedly, and cause test flakiness in issues
like SCYLLADB-864:
- fixing false node conviction when echo succeeds,
- increasing the failure_detector_timeout fixture.
Fixes: SCYLLADB-864
No need for backport: related CI failures are rare, and merging #29522
made them even more unlikely (I haven't seen one since then, but it's
still possible to reproduce locally on dev machines).
Closesscylladb/scylladb#29755
* github.com:scylladb/scylladb:
test/cluster: increase failure_detector_timeout
gossiper: fix false node conviction when echo succeeds
The `vector_store_client_test_dns_resolving_repeated` test was intermittently
timing out on CI. The exact root cause is not fully understood, but the
hypothesis is that a single trigger signal can be lost somewhere (not exactly
known where). This is not an issue for the production code because refresh
trigger will be called multiple times whenever all configured nodes will be
unreachable.
Fixes SCYLLADB-1794
Backport to 2026.1 and 2026.2, as the same CI flakiness can occur on these branches.
Closesscylladb/scylladb#29752
* github.com:scylladb/scylladb:
vector_search: test: default timeout in test_dns_resolving_repeated
vector_search: test: fix flaky test_dns_resolving_repeated
Fixes: SCYLLADB-1815
If we're in a brand new chunk (no buffer yet allocated), we would miscalculate the actual size of an entry to write, possibly causing segment size overshoot. Break out some logic to share between this calc and new_buffer. Also remove redundant (and possibly wrong) constant in oversized allocation.
As for the test:
Checking segment sizes should not use a size filter that rounds (up) sizes.
More importantly, the estimate for what is acceptable limit for commitlog disk usage should be aligned. Simplified the calc, and also made logging more useful in case of failure.
Closesscylladb/scylladb#29753
* github.com:scylladb/scylladb:
commitlog_test.py: Fix size check aliasing, and threshold calc.
commitlog: Fix segment/chunk overhead maybe not included in next_position calculation
The auth cache crashes when it encounters rows in role_permissions that have a live row marker but no permissions column. These “ghost rows” were created by the now-removed auth v2 migration, which used INSERT (creating row markers) instead of UPDATE.
When permissions were later revoked, the row marker remained while the permissions column became null. An empty collection appears as null, since its lifetime is based only on its element's cells.
As a result, when the cache reloads and expects the permissions column to exist, it hits a missing_column exception.
The series removes dead code that was the primary crash site, adds has() guards to the remaining access paths, and includes a test reproducer.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1816
Backport: all supported versions 2026.1, 2025.4, 2025.1
Closesscylladb/scylladb#29757
* github.com:scylladb/scylladb:
test: add reproducer for auth cache crash on missing permissions column
auth: tolerate missing permissions column in authorize()
auth: add defensive has() guard for role_attributes value column
auth: remove unused permissions field from cache role_record
Commit cf237e060a introduced 'from test.pylib.driver_utils import
safe_driver_shutdown' in pgo/exec_cql.py. This module runs during PGO
profile training (a build step) where the test package is not on the
Python path, causing an immediate ModuleNotFoundError on both x86 and
ARM. Revert to plain cluster.shutdown() which is sufficient for the
single-use PGO training scenario.
Fixes: SCYLLADB-1792
Closesscylladb/scylladb#29746
Verify that double-quoted column names with hyphens (e.g. "my-col")
work correctly for CREATE TABLE, INSERT, and SELECT. Also verify that
unquoted hyphenated names are rejected with a syntax error.
The 'address' UDT contains a nested collection (map<text, frozen<phone>>),
so it must be frozen when used as a column type. Non-frozen UDTs with
nested non-frozen collections are not supported.
Column names containing hyphens must be double-quoted. Also fix
the PRIMARY KEY reference from 'customer_id' (non-existent) to
'cust_id' (the actual column).