For tablet_virtual_task, set creation_time to tablet_task_info::request_time
and start_time to tablet_task_info::sched_time, matching the actual semantics
of when the request was created vs when it was scheduled for execution.
Co-authored-by: Deexie <56607372+Deexie@users.noreply.github.com>
Combined the two separate topology_request_tracking_mutation_builder calls
into one, setting both start_time and done status in the same mutation builder
to reduce redundancy.
Co-authored-by: Deexie <56607372+Deexie@users.noreply.github.com>
Refactored start_time setting for global requests to be included in the same
mutation batch that starts the actual operation, matching the pattern used for
node operations. This avoids an extra update_topology_state call and ensures
start_time is set atomically with the operation start.
Also updated nodetool tasks documentation to include creation_time field in
example outputs for status, list, and tree commands.
Co-authored-by: Deexie <56607372+Deexie@users.noreply.github.com>
Extended scylla-nodetool.cc to display creation_time in all task-related outputs:
- tasks_print_status: Added creation_time to time field formatting
- tasks_print_trees: Added creation_time column to task tree display
- tasks_print_stats_list: Added creation_time column to task stats list
Co-authored-by: Deexie <56607372+Deexie@users.noreply.github.com>
This reverts commit ff1b212319. In this
commit, the python driver was updated to 3.29.6. That version has a
serious flaw - it rejects compression=None settings [1] which
cqlsh (legitimately) uses in copyutil.py.
The reason this hasn't caused numerous continuous integration failures
is that the submodule update commit did not update the frozen toolchain,
so the build was effectively running with an older version of the driver.
Fix by reverting the change. This allows us to regenerate the frozen
toolchain when we need to.
Reverted changes:
* tools/cqlsh 2240122...6badc99 (2):
> Update scylla-driver version to 3.29.6
> Revert "Migrate workflows to Blacksmith"
[1] 78f554236fClosesscylladb/scylladb#27473
This is an optimization follow-up [for this PR](https://github.com/scylladb/scylladb/pull/27396#issuecomment-3611410774): avoiding destruction of foreign objects on the wrong shard. Releasing objects allocated on a different shard causes their ::free calls to be executed remotely, which adds unnecessary load to the SMP subsystem.
Before this PR, a `std::vector<put_or_delete_item>` could be moved to another shard. When the vector was eventually destroyed, its ::free had to be marshalled back to the shard where the memory had originally been allocated. This change avoids that overhead by passing the vector by const reference instead.
backport: not needed, this is an optimization
Closesscylladb/scylladb#27432
* github.com:scylladb/scylladb:
alternator/executor.cc: avoid cross-shard free
storage_proxy: cas: take cas_request by raw reference
With python 3.14, the Process fails due to pickling issue with nodes objects.
This will eliminate this issue, so we can bump up the python version.
Closesscylladb/scylladb#27456
This commit is an optimization: avoiding destruction of
foreign objects on the wrong shard. Releasing objects allocated on a
different shard causes their ::free calls to be executed remotely,
which adds unnecessary load to the SMP subsystem.
Before this patch, a std::vector could be moved
to another shard. When the vector was eventually destroyed,
its ::free had to be marshalled back to the shard where the memory had
originally been allocated. This change avoids that overhead by passing
the vector by const reference instead.
The referenced objects lifetime correctness reasoning:
* the put_or_delete_item refs usages in put_or_delete_item_cas_request
are bound to its lifetime
* cas_request lifetime is bound to storage_proxy::cas future
* we don't release put_or_delete_item-s untill all storage_proxy::cas
calls are done.
In the next commit we want to add an optimization that relies on
precise control over the lifetime of cas_request. In particular, we
want the implementation of this interface in Alternator to operate on
raw references that are guaranteed to remain valid only until the
cas() future is resolved. We already depend on the same lifetime
assumptions in cas_request when used by modification_statement.
However, these assumptions are not clearly expressed in the current
interface: cas_request is taken by shared_ptr, and nothing prevents
cas() from storing that pointer inside paxos_response_handler, which
may outlive the cas() future.
This commit fixes that by taking cas_request by raw reference. This
makes it explicit that cas() does not assume ownership of the object.
Callers must ensure that the referenced object remains valid until
the returned future is resolved.
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:
lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;
Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.
This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.
Fixes#25799Closesscylladb/scylladb#27470
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries. In some
workloads that could be many entries, even the whole cache. Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.
This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.
perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):
Generated 1124195 rows
Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
Draining...
took 0.000616 [ms]
cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0
Fixes#23479Fixes#2578Closesscylladb/scylladb#27469
* github.com:scylladb/scylladb:
cache, mvcc: Preempt cache update when applying range tombstone from memtable
partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone()
perf-row-cache-update: Add scenario with large tombstone covering many rows
We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.
Fixes https://github.com/scylladb/scylladb/issues/27142
Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.
Closesscylladb/scylladb#27387
* github.com:scylladb/scylladb:
direct_failure_detector: run direct failure detector in the gossiper scheduling group
raft: drop invoke_on from the pinger verb handler
direct_failure_detector: pass timeout to direct_fd_ping verb
We switched to using v3 schema tables (in system_schema keyspace) in
2017, in 9eb91bc30b.
So no system should have the old schema any more.
No need to run legacy_schema_migrator on boot.
Closesscylladb/scylladb#27420
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries. In some
workloads that could be many entries, even the whole cache. Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.
This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.
perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):
Generated 1124195 rows
Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
Draining...
took 0.000616 [ms]
cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0
Fixes#23479Fixes#2578
Fills memtable with rows and a tombstone which deletes all rows which
are already in cache.
Similar to raft log workload, but more extreme.
With -c1 -m4G, observed really bad performance:
update: 1711.976196 [ms], preemption: {count: 22603, 99%: 0.943127 [ms], max: 1494.571776 [ms]}, cache: 2148/2906 [MB], alloc/comp: 1334/869 [MB] (amp: 0.651), pr/me/dr 1062186/0/1062187
cache: 2148/2906 [MB], memtable: 738/1024 [MB], alloc/comp: 993/0 [MB] (amp: 0.000)
Which means that max reactor stall during cache update was 1.5 [s]
0.7 GB memtables. 2.1 GB in cache.
The DynamoDB API's "BatchWriteItem" operation is spelled like this, in
singular. Some comments incorrectly referred to as BatchWriteItems - in
plural. This patch fixes those mistakes.
There are no functional changes here or changes to user-facing documents -
these mistakes were only in code comments.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27446
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
All important SSTable components (Index, Partitions, Rows, Summary, Filter, CompressionInfo, and TOC) are covered.
Several test cases where introduced to verify expected behaviour.
Backport is not required, it is a new feature
Fixes#20100Closesscylladb/scylladb#27287
* github.com:scylladb/scylladb:
sstable_test: add verification testcases of SSTable components digests persistance
sstables: store digest of all sstable components in scylla metadata
sstables: Add TemporaryScylla metadata component type
sstables: Extract file writer closing logic into separate methods
sstables: Add components_digests to scylla metadata components
sstables: Implement CRC32 digest-only writer
Fixes#27367Fixes#27362Fixes#27366
Makes http URL parser handle IPv6.
Makes KMIP host setup handle IPv6 hosts + use system trust if no truststore set
Moves Azure/KMS code to use shared http URL parser to avoid same regex everywhere.
Closesscylladb/scylladb#27368
* github.com:scylladb/scylladb:
ear::kms/ear::azure: Use utils::http URL parsing
ear::kmip_host: Handle ipv6 hosts + use system trust when not specified
utils::http: Handle ipv6 numeric host part in URL:s
It is observed that:
repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update
system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb:
seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum
token,maximum token) is not in the format of (start, end])
This is because repair checks the end of the range to be repaired needs
to be inclusive. When small_table_optimization is enabled for regular
repair, a (minimum token,maximum token) will be used.
To fix, we can relax the check of (start, end] for the min max range.
Fixes#27220Closesscylladb/scylladb#27357
The rf_rack_valid_keyspaces option needs to be turned on in order to
allow creating materialized views in tablet keyspaces with numeric RF
per DC. This is also necessary for secondary indexes because they use
materialized views underneath. However, this option is _not_ necessary
for vector store indexes because those use the external vector store
service for querying the list of keys to fetch from the main table, they
do not create a materialized view. The rf_rack_valid_keyspaces was, by
accident, required for vector indexes, too.
Remove the restriction for vector store indexes as it is completely
unnecessary.
Fixes: SCYLLADB-81
Closesscylladb/scylladb#27447
During b9199e8b24
reivew it was suggested to use standard for loop
but when erasing element it causes increment on
invalid iterator, as role could have been erased
before.
This change brings back original code.
Fixes: https://github.com/scylladb/scylladb/issues/27422
Backport: no, offending commit not released yet
Closesscylladb/scylladb#27444
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in the sstable structure and later persisted to disk
as part of the Scylla metadata component during writer::consume_end_of_stream.
After 39cec4ae45 node join may fail with either "request canceled" notification or (very rarely) because it was banned. Depend on timing. The series fixes the test to check for both possibilities.
Fixes#27320
No need to backport since the flakiness is in the mater only.
Closesscylladb/scylladb#27408
* https://github.com/scylladb/scylladb:
test: fix test_coordinator_queue_management flakiness
test/pylib: allow expected_error in server_start to contain regular expression
The storage_service REST API uses `group0` internally. Before this
patch, it was possible to send an HTTP request before `group0` was
initialized, which resulted in a segmentation fault. Therefore,
this patch delays the setup of the storage_service REST API.
Additionally, `test_rest_api_on_startup` is added to reproduce the problem.
Fixes: https://github.com/scylladb/scylladb/issues/27130
No backport. It's a crash fix but possible only if a request is sent in a very specific phase of a node start.
Closesscylladb/scylladb#27410
* github.com:scylladb/scylladb:
test: add test_rest_api_on_startup
main: delay setup of storage_service REST API
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:
```c++
// [1, 2, 3, 0] --> [0, 1, 3, 6]
std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```
However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.
As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.
An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57
The fix is simple: the initial value is passed as uint64_t instead of int.
Fixes https://github.com/scylladb/scylladb/issues/27417Closesscylladb/scylladb#27418
Fixes#27362
The KMIP host connector should handle ipv4 connections (named or numeric).
It also should fall back to system trust when truststore is not specified.
Fixes#27366
A URL with numeric host part formats special in case of ipv6,
to avoid confusion with port part.
The parser should handle this.
I.e.
http://[2001:db8:4006:812::200e]:8080
v2:
* Include scheme agnostic parse + case insensitive scheme matching
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.
When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.
Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were:
* Pruning with concurrency 1 took total 1416 seconds
* Pruning with concurrency 2 took total 731 seconds
* Pruning with concurrency 10 took total 234 seconds
* Pruning with concurrency 100 took total 171 seconds
So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE
Fixes https://github.com/scylladb/scylladb/issues/27070Closesscylladb/scylladb#27097
* github.com:scylladb/scylladb:
mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
cql: add CONCURRENCY to the USING clause
Currently, _flush_time was stored as a std::optional<gc_clock::time_point>
and std::nullopt indicates that the flush was needed but failed. It's confusing
for the caller and does not work as expected since the _flush_time is initialized
with value (not optional).
Change _flush_time type to gc_clock::time_point. If a flush is needed but failed,
get_flush_time() throws an exception.
This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319
but it was mistakenly overwritten during rebases.
Refs: https://github.com/scylladb/scylladb/issues/24415.
Closesscylladb/scylladb#26794
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.
Currently raft direct pinger verb jumps to shard 0 to check if group0 is
alive before replying. The verb runs relatively often, so it is not very
efficient. The patch distributes group0 liveness information (as it
changes) to all shard instead, so that the handler itself does not need
to jump to shard 0.
p11-kit has hardcoded paths for the trust paths. Of course, each
Linux distribution hardcodes those paths differently. As a result,
our relocatable gnutls, which uses p11-kit-trust.so to process the
trust paths, needs some overrides to select the right paths.
Currently, we use p11_kit_override_system_files(), a p11-kit API
intended for testing, but which worked well enough for our purpose,
to override the trust module configuration.
Unfortunately, starting (presumably [1]) in gnutls 3.8.11, gnutls
changed how it works with p11-kit and our override is now ignored.
This was likely unintentional, but there appears to be a better way:
instead of letting gnutls auto-load the trust module from a hacked
configuration, we load the modules outselves using
gnutls_pkcs11_init(GNUTLS_PKCS11_FLAG_MANUAL) and
gnutls_pkcs11_add_provider(). These appear to be intended for the purpose.
We communicate the paths to the scylla executable using an environment
variable. This isn't optimal, but is much easier than adding a command
line variable since there are multiple levels of command line parsing due
to the subtool mechanism.
With this, we unlock the possibility to upgrade gnutls to newer versions.
[1] aa5f15a872Closesscylladb/scylladb#27348
After 39cec4ae45 node join may fail with either "request canceled"
notification or (very rarely) because it was banned. Depend on timing.
The patch fixes the test to check for both possibilities.
Fixes#27242
Similar to AWS, google services may at times simply return a 503,
more or less meaning "busy, please retry". We rely for most cases
higher up layers to handle said retry, but we cannot fully do so,
because both we reach this code sometimes through paths that do
no such thing, and also because it would be slightly inefficient,
since we'd like to for example control the back-off for auth etc.
This simply changes the existing retry loop in gcp_host to
be a little more forgiving, special case 503 errors and extend
the retry to the auth part, as well as re-use the
exponential_backoff_retry primitive.
v2:
* Avoid backoff if refreshing credentials. Should not add latency due to this.
* Only allow re-auth once per (non-service-failure-backoff) try.
* Add abort source to both request and retry
v3:
* Include timeout and other server errors in retry-backoff
v4:
* Reorder error code handling correctly
Closesscylladb/scylladb#27267
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.
Redirections are added for all the removed pages.
Fixes https://github.com/scylladb/scylladb/issues/26871Closesscylladb/scylladb#27277
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
that is reverted
During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.
This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.
Fixes https://github.com/scylladb/scylladb/issues/26512
It's a pre existing issue. Backport is required to all recent 2025.x versions.
Closesscylladb/scylladb#27413
* github.com:scylladb/scylladb:
topology_coordinator: Fix the indentation for the cleanup_target case
topology_coordinator: Add barrier to cleanup_target
test_node_failure_during_tablet_migration: Increase RF from 2 to 3
Add TemporaryScylla component type to make atomic updates of SSTable Scylla metadata using temporary files
and atomic rename operations possible. This will be needed in further commit to rewrite metadata together with
the statistics component.
This patch adds separate group for vector search parameters in the
documentation and fixes small typos and formatting.
Fixes: SCYLLADB-77.
Closesscylladb/scylladb#27385
When a vector store node becomes unreachable, a client request sent
before the keep-alive timer fires would hang until the CQL query
timeout was reached.
This occurred because the HTTP request writes to the TCP buffer and then
waits for a response. While data is in the buffer, TCP retransmissions
prevent the keep-alive timer from detecting the dead connection.
This patch resolves the issue by setting the `TCP_USER_TIMEOUT` socket
option, which applies an effective timeout to TCP retransmissions,
allowing the connection to fail faster.
Closesscylladb/scylladb#27388
Scylla uses zlib, through the header <zlib.h>, in sstable compression.
We also want to use it in Alternator for gzip-compressed requests.
We never actually required zlib explicltly in install-dependencies.sh,
we only get it through transitive dependencies. But it's better to
require it explicitly so this is what we do in this patch.
In Fedora, we use the newer, more efficient, zlib-ng which is API-
compatible with the classic zlib. Unfortunately, the Debian zlib-ng
package is *not* drop-in compatible with zlib (you need to include
a different header file <zlib-ng.h>) so we use the classic zlib.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27238
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
that is reverted
During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.
This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.
Fixes https://github.com/scylladb/scylladb/issues/26512
The patch prepares the test for additional write workload to be
executed in parallel with node failures. With the original RF=2,
QUORUM is also 2, which causes writes to fail during node outage.
To address it, the third rack with a single node is added and the
replication factor is increased to 3.
This test verifies that REST API requests are handled properly
when a server is started or restarted. It is used to verify
the fix for scylladb/scylladb#27130, where a server failed with a
segmentation fault when `storage_service/raft_topology/reload` was
called too early.
Refs: scylladb/scylladb#27130
The storage_service REST API uses `group0` internally. Before this
patch, it was possible to send an HTTP request before `group0` was
initialized, which resulted in a segmentation fault. Therefore,
this patch delays the setup of the storage_service REST API.
Fixes: scylladb/scylladb#27130
Creating endpoint conf can be made with the s3_server method
Getting boto3 resource from s3_server itself is also possible
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27380
The test executes a LWT query in order to create a paxos state table and
verify the table properties. However, after executing the LWT query, the
table may not exist on all nodes but only on a quorum of nodes, thus
checking the properties of the table may fail if the table doesn't exist
on the queried node.
To fix that, execute a group0 read barrier to ensure the table is
created on all nodes.
Fixesscylladb/scylladb#27398Closesscylladb/scylladb#27401
The metrics extension now includes validation to detect missing metrics. This validation caused failures during multiversion publication because older versions did not generate all required properties.
Instead of fixing each branch, a strict mode flag was introduced to control when validation should run.
Strict mode is enabled in the workflow that validates pull requests, ensuring that new changes meet the expected metrics.
During multiversion builds, validation errors are now logged but do not raise exceptions, which prevents build failures while still providing visibility into missing data.
docs: verbose mode
docs: verbose mode
Closesscylladb/scylladb#27402
Since Alternator is now using tablets by default, it's no longer possible
to create an Alternator table on a 3-node cluster with a single rack -
you need to have 3 racks to support RF=3.
Most of the multi-node Alternator tests in test/cluster/test_alternator.py
were already fixed to use a 3-rack cluster, but one test was missed
because it was marked "xfail" so its new failure to create the table was
missed. This patch adds the missing 3-rack setup, so the xfailing test
returns to failing on the real bug - not on the table creation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27382
Even though that `view_building_coordinator::work_on_view_building` has
an `if` at the very beginning which checks whether the currently
processed base table is set, it only prints a message and continues
executing the rest of the function regardless of the result of the
check. However, some of the logic in the function assumes that the
currently processed base table field is set and tries to access the
value of the field. This can lead to the view building coordinator
accessing a disengaged optional, which is undefined behavior.
Fix the function by adding the clearly missing `co_await` to the check.
A regression test is added which checks that the view building state
observer - a different fiber which used to print a weird message due to
erroneus view building coordinator behavior - does not print a warning.
Fixes: scylladb/scylladb#27363Closesscylladb/scylladb#27373
The motivation for the update is using newer version of scylla-driver
that supports new event type CLIENT_ROUTES_CHANGE.
* tools/cqlsh 22401228...6badc992 (2):
> Update scylla-driver version to 3.29.6
> Revert "Migrate workflows to Blacksmith"
Closesscylladb/scylladb#27359
Fixes#27268
Refs #27268
Includes the auth call in code covered by backoff-retry on server
error, as well as moves the code to use the shared primitive for this
and increase the resilience a bit (increase retry count).
v2:
* Don't do backoff if we need to refresh credentials.
* Use abort source for backoff if avail
v3:
* Include other retryable conditions in auth check
Closesscylladb/scylladb#27269
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.
This reverts commit b0643f8959, reversing
changes made to e8b0f8faa9.
The change forgot to update
sstables_manager::get_highest_supported_format(), which results in
/system/highest_supported_sstable_version still returning me, confusing
and breaking tests.
Fixes: scylladb/scylla-dtest#6435Closesscylladb/scylladb#27379
The root cause for the hanging test is a concurrency deadlock.
`vector_store_client` runs dns refresh time and it is waiting for the condition
variable.After aborting dns request the test signals the condition variable.
Stopping the vector_store_client takes time enough to trigger the next dns
refresh - and this time the condition variable won't be signalled - so
vector_store_client will wait forever for finish dns refresh fiber.
The commit fixes the problem by waiting for the condition variable only once.
Fixes: #27237
Fixes: VECTOR-370
Closesscylladb/scylladb#27239
This PR introduces two key improvements to the robustness and resource management of vector search:
Proper Abort on CQL Timeout: Previously, when a CQL query involving a vector search timed out
, the underlying ANN query to the vector store was not aborted and would continue to run. This has been fixed by ensuring the abort source is correctly signaled, terminating the ANN request when its parent CQL query expires and preventing unnecessary resource consumption.
Faster Failure Detection: The connection and keep-alive timeouts for vector store nodes were excessively long (2 and 11 minutes, respectively), causing significant delays in detecting and recovering from unreachable nodes. These timeouts are now aligned with the request_timeout_in_ms setting, allowing for much faster failure detection and improving high availability by failing over from unresponsive nodes more quickly.
Fixes: SCYLLADB-76
This issue affects the 2025.4 branch, where similar HA recovery delays have been observed.
Closesscylladb/scylladb#27377
* github.com:scylladb/scylladb:
vector_search: Fix ANN query abort on CQL timeout
vector_search: Reduce connection and keep-alive timeouts
Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request.
Fixes: https://github.com/scylladb/scylladb/issues/27349Closesscylladb/scylladb#27350
When a CQL vector search request timed out, the underlying ANN query was
not aborted and continued to run. This happened because the abort source
was not being signaled upon request expiration.
This commit ensures the ANN query is aborted when the CQL request times out
preventing unnecessary resource consumption.
The connection timeout was 2 minutes and the keep-alive
timeout was 11 minutes. If a vector store node became unreachable, these
long timeouts caused significant delays before the system could recover,
negatively impacting high availability.
This change aligns both timeouts with the `request_timeout`
configuration, which defaults to 10 seconds. This allows for much
faster failure detection and recovery, ensuring that unresponsive nodes
are failed over from more quickly.
set worksteal disribution for xdist(new sheduler)
Because now it shows better tests distribution that standart(load) in CI
Closesscylladb/scylladb#27354
Add comprehensive coding guidelines for GitHub Copilot to improve
quality and consistency of AI-generated code. Instructions cover C++
and Python development with language-specific best practices, build
system usage, and testing workflows.
Following GitHub Copilot's standard layout with general instructions
in .github/copilot-instructions.md and language-specific files in
.github/instructions/ directory using *.instructions.md naming.
No backport: This change is only for developers in master, so it doesn't
need to be backported.
Closesscylladb/scylladb#25374
In this series we implement Alternator's support for gzip-compressed
requests, i.e., requests with the "Content-Encoding: gzip" header,
other uncompressed header, and a gzip-compressed body.
The server needs to verify the signature of the *compressed* content,
and then uncompress the body before running the request.
We only support gzip compression because this is what DynamoDB supports.
But in the future we can easily add support for other compression
algorithms like lz4 or zstd.
This series Refs #5041 but doesn't "Fixes" it because it only implements
compressed requests (Content-Encoding), *not* compressed responses
(Accept-Encoding).
In addition to the code changes, the series also contains tests for this
feature that make sure it behaves like DynamoDB.
Note that while we will have now support in our server for compressed
requests, just like DynamoDB does, the clients (AWS SDKs) will probably
NOT make use of it because they do not enable request compression by
default. For example, see the tests for some hoops one needs to jump
through in boto3 (the Python SDK) to send compressed requests. However,
we are hoping that in the future Alternator's modified clients will
use compressed requests and enjoy this feature.
Closesscylladb/scylladb#27080
* github.com:scylladb/scylladb:
test/alternator: enable, and add, tests for gzip'ed requests
alternator: implement gzip-compressed requests
seastar::compat::source_location (which should not have been used
outside Seastar) is replaced with std::source_location to avoid
deprecation warnings. The relevant header, which was removed, is no
longer included.
* seastar 8c3fba7a...b5c76d6b (3):
> testing: There can be only one memory_data_sink
> util: Use std::source_location directly
> Merge 'net: support proxy protocol v2' from Avi Kivity
apps: httpd: add --load-balancing-algorithm
apps: httpd: add /shard endpoint
test: socket_test: add proxy protocol v2 test suite
test: socket_test: test load balancer with proxy protocol
net: posix_connected_socket: specialize for proxied connections
net: posix_server_socket_impl: implement proxy protocol in server sockets
net: posix_server_socket_impl: adjust indentation
net: posix_server_socket_impl: avoid immediately-invoked lambda
net: conntrack: complete handle nested class special member functions
net: posix_server_socket_impl: coroutinize accept()
Closesscylladb/scylladb#27316
This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are:
- introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams,
- reduce splitting of simple Alternator mutations,
- correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs.
To summarize, the emitted events of the data manipulation operations should be as follows:
- DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK)
- DeleteItem of nonexistent item: nothing (OK)
- BatchWriteItem.DeleteItem of nonexistent item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK)
No backport is necessary.
Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26396
Refs https://github.com/scylladb/scylladb/issues/26382
Fixes https://github.com/scylladb/scylladb/issues/6918
Closes scylladb/scylladb#26121
* github.com:scylladb/scylladb:
test/alternator: Enable the tests failing because of #6918
alternator, cdc: Don't emit events for no-op removes
alternator, cdc: Don't emit an event for equal items
alternator/streams, cdc: Differentiate item replace and item update in CDC
alternator: Change the return type of rmw_operation_return
config: Add alternator_streams_strict_compatibility flag
cdc: Don't split a row marker away from row cells
Consider this:
1) n1 is the topology coordinator
2) n1 schedules and executes a tablet repair with session id s1 for a
tablet on n3 an n4.
3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1]
4) n1 steps down before it executes
locator::tablet_transition_stage::end_repair
5) n2 becomes the new topology coordinator
6) n2 runs locator::tablet_transition_stage::repair again
7) n3 and n4 try to take the lock again and hangs since the lock is
already taken.
To avoid the deadlock, we can throw in step 7 so that n2 will
proceed to end_repair stage and release the lock. After that, the
scheduler could schedule the tablet repair request again.
Fixes#26346Closesscylladb/scylladb#27163
Fix unlikely use-after-free in `encode_paging_state`. The function
incorrectly assumes that current position to encode will always have
data for all clustering columns the schema defines. It's possible to
encounter current position having less than all columns specified, for
eample in case of range tombstone. Those don't happen in Alternator
tables as DynamoDB doesn't allow range deletions and clustering key
might be of size at most 1. Alternator api can be used to read
scylla system tables and those do have range tombstones with more
than single clustering column.
The fix is to stop trying to encode columns, that don't have the value -
they are not needed anyway, as there's no possible position with those
values (range tombstone made sure of that).
Fixes#27001Fixes#27125Closesscylladb/scylladb#26960
Previously, `wait_until_driver_service_level_created` only waited for
the `driver` service level to appear in the output of
`LIST ALL SERVICE_LEVELS`. However, the fact that one node lists
`sl:driver` does not necessarily mean that all other nodes can see
it yet. This caused sporadic test failures, especially in DEBUG builds.
To prevent these failures, this change adds an extra wait for
a `raft/read_barrier` after the `driver` service level first appears.
This ensures the service level is globally visible across the cluster.
Fixes: https://github.com/scylladb/scylladb/issues/27019
Na backport - test fix for `sl:driver` tests, and this that is only available on `master`
Closesscylladb/scylladb#27076
* github.com:scylladb/scylladb:
test: wait for read_barrier in wait_until_driver_service_level_created
test: use ManagerClient in wait_until_driver_service_level_created
Before this commit, any attempt to create, alter, attach, or drop
the default service level would result in a syntax error whose
error message was unclear:
```
cqlsh> attach service level default to cassandra;
SyntaxException: line 1:21 no viable alternative at input 'default'
```
The error stems from the grammar not being able to parse `default`
as a correct service level name. To fix that, we cover it manually.
This way, the grammar accepts it and we can process it in Scylla.
The reason why we'd like to cover the default service level is that
it's an actual service level that the user should reference. Getting
a syntax error is not what should happen. Hence this fix.
We validate the input and if the given role is really the default
service level, we reject the query and provide an informative error
message.
Two validation tests are provided.
Fixesscylladb/scylladb#26699Closesscylladb/scylladb#27162
When we come across a segment truncation, this information may
be helpful to determine when the error occurred exactly and
hint at what code path might've led to it.
Closesscylladb/scylladb#27207
Fixes#24346
When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.
This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.
Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).
v2:
* Added unit test
Closesscylladb/scylladb#27236
In several exception handlers, only raft::request_aborted was being
caught and rethrown, while seastar::abort_requested_exception was
falling through to the generic catch(...) block. This caused the
exception to be incorrectly treated as a failure that triggers
rollback, instead of being recognized as an abort signal.
For example, during tablet draining, the error log showed:
"tablets draining failed with seastar::abort_requested_exception
(abort requested). Aborting the topology operation"
This change adds seastar::abort_requested_exception handling
alongside raft::request_aborted in all places where it was missing.
When rethrown, these exceptions propagate up to the main run() loop
where handle_topology_coordinator_error() recognizes them as normal
abort signals and allows the coordinator to exit gracefully without
triggering unnecessary rollback operations.
Fixes: scylladb/scylladb#27255
No backport: The problem was only seen in tests and not reported in
customer tickets, so it's enough to fix it in the main branch.
Closesscylladb/scylladb#27314
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.
Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.
This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.
Fixesscylladb/scylladb#27304Closesscylladb/scylladb#27312
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
And switch to std::source_location.
Upcoming seastar update will deprecate its compatibility layer.
The patch is
for f in $(git grep -l 'seastar::compat::source_location'); do
sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f;
done
and removal of few header includes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27309
`git log --format` doesn't add a newline after the last line. This
causes `read` to ignore that line, losing the last line (corresponding
to the first commit).
Use `git log --tformat` instead, which terminates the last line.
Closesscylladb/scylladb#27317
The reason for this seastar update is to have the fixed handling
of the `integer` type in `seastar-json2code` because it's needed
for further development of ScyllaDB REST API.
The following changes were introduced to ScyllaDB code to ensure it
compiles with the updated seastar:
- Remove `seastar/util/modules.hh` includes as the file was removed
from seastar
- Modified `metrics::impl::labels_type` construction in
`test/boost/group0_test.cc` because now it requires `escaped_string`
* seastar 340e14a7...8c3fba7a (32):
> Merge 'Remove net::packet usage from dns.cc' from Pavel Emelyanov
dns: Optimize packet sending for newer c-ares versions
dns: Replace net::packet with vector<temporary_buffer>
dns: Remove unused local variable
dns: Remove pointless for () loop wrapping
dns: Introduce do_sendv_tcp() method
dns: Introduce do_send_udp() method
> test: Add http rules test of matching order
> Merge 'Generalize packet_data_source into memory_data_source' from Pavel Emelyanov
memcached: Patch test to use memory_data_source
memcached: Use memory_data_source in server
rpc: Use memory_data_sink without constructing net::packet
util: Generalize packet_data_source into memory_data_source
> tests: coroutines: restore "explicit this" tests
> reactor: remove blocking of SIGILL
> Merge 'Update compilers in GH actions scripts' from Pavel Emelyanov
github: Use gcc-14
github: Use clang-20
> Merge 'Reinforce DNS reverse resolution test ' from Pavel Emelyanov
test: Make test_resolve() try several addresses
test: Coroutinize test_resolve() helper
> modules: make module support standards-compliant
> Merge 'Fix incorrect union access in dns resolver' from Pavel Emelyanov
dns: Squash two if blocks together
dns: Do not check tcp entry for udp type
> coroutine: Fix compilation of execute_involving_handle_destruction_in_await_suspend
> promise: Document that promise is resolved at most once
> coroutine: exception: workaround broken destroy coroutine handle in await_suspend
> socket: Return unspecified socket_address for unconnected socket
> smp: Fix exception safety of invoke_on_... internal copying
> Merge 'Improve loads evaluation by reactor' from Pavel Emelyanov
reactor: Keep loads timer on reactor
reactor: Update loads evaluation loop
> Merge 'scripts: add 'integer' type to seastar-json2code' from Andrzej Jackowski
test: extend tests/unit/api.json to use 'integer' type
scripts: add 'integer' type to seastar-json2code
> Merge 'Sanitize tls::session::do_put(_one)? overloads' from Pavel Emelyanov
tls: Rename do_put_one(temporary_buffer) into do_put()
tls: Fix indentation after previous patch
tls: Move semaphore grab into iterating do_put()
> net: tcp: change unsent queue from packets to temporary_buffer:s
> timer: Enable highres timer based on next timeout value
> rpc: Add a new constructor in closed_error to accept string argument
> memcache: Implement own data sink for responses
> Merge 'file: recursive_remove_directory: general cleanup' from Avi Kivity
file: do_recursive_remove_directory(): move object when popping from queue
file: do_recursive_remove_directory(): adjust indentation
file: do_recursive_remove_directory(): coroutinize
file: do_recursive_remove_directory(): simplify conditional
file: do_recursive_remove_directory(): remove wrong const
file: do_recursive_remove_directory(): clean up work_entry
> tests: Move thread_context_switch_test into perf/
> test: Add unit test for append_challenged_posix_file
> Merge 'Prometheus metrics handler optimization' from Travis Downs
prometheus: optimize metrics aggregation
prometheus: move and test aggregate_by helper
prometheus: various optimizations
metrics: introduce escaped_string for label values
metric:value: implement + in terms of +=
tests: add prometheus text format acceptance tests
extract memory_data_sink.hh
metrics_perf: enhance metrics bench
> demos: Simplify udp_zero_copy_demo's way of preparing the packet
> metrics: Remove deprecated make_...-ers
> Merge 'Make slab_test be BOOST kind' from Pavel Emelyanov
test: Use BOOST_REQUIRE checkers
test: Replace some SEASTAR_ASSERT-s with static_assert-s
test: Convert slab test into boost kind
> Merge 'Coroutinize lister_test' from Pavel Emelyanov
test: Fix indentation after previuous patch
test: Coroutinize lister_test lister::report() method
test: Coroutinize lister_test main code
> file: recursive_remove_directory(): use a list instead of a deque
> Merge 'Stop using packets in tls data_sink and session' from Pavel Emelyanov
tls: Stop using net::packet in session::put()
tls: Fix indentation after previous patch
tls: Split session::do_put()
tls: Mark some session methods private
Closesscylladb/scylladb#27240
After in the previous patch we implemented support in Alternator for
gzip-compressed requests ("Content-Encoding: gzip"), here we enable
an existing xfail-ing test for this feature, and also add more tests
for more cases:
* A test for longer compressed requests, or a short compressed
request which expands to a longer request. Since the decompression
uses small buffers, this test reaches additional code paths.
* Check for various cases of a malformed gzip'ed request, and also
an attempt to use an unsupported Content-Encoding. DynamoDB
returns error 500 for both cases, so we want to test that we
do to - and not silently ignore such errors.
* Check that two concatenated gzip'ed streams is a valid request,
and check that garbage at the end of the gzip - or a missing
character at the end of the gzip - is recognized as an error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.
When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.
Fixes https://github.com/scylladb/scylladb/issues/27070
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.
We fix both places in this PR.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72
This PR is a bugfix that should be backported to all supported branches.
Closesscylladb/scylladb#27265
* github.com:scylladb/scylladb:
locator/node: include _excluded in verbose formatter
locator/node: preserve _excluded in clone()
Otherwise, in a mixed cluster, the handle_tablet_resize_finalization
would fail because of the unknown rpc verb.
Fixes#26309Closesscylladb/scylladb#27218
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.
Fixes https://github.com/scylladb/scylladb/issues/27141Closesscylladb/scylladb#27140
* https://github.com/scylladb/scylladb:
test: test that expired erm that held for too long triggers notification
token_metadata: fix notification about expiring erm held for to long
This patch enforces that vector indexes can only be created on keyspaces
that use tablets. During index validation, `check_uses_tablets()` verifies
the base keyspace configuration and rejects creation otherwise.
To support this, the `custom_index::validate()` API now receives a
`const data_dictionary::database&` parameter, allowing index
implementations to access keyspace-level settings during DDL validation.
Fixes https://scylladb.atlassian.net/browse/VECTOR-322Closesscylladb/scylladb#26786
When applying group0_command we now inspect
whether any auth internal tables were modified,
and reload affected role entries in the cache.
Since one auth DML may change multiple tables,
when iterating over mutations we deduplicate
affected roles across those tables.
Receiving snaphot is a rare event so as a simplification
we'll be reloading the whole cache instead of trying to merge
states, especially that expected size is small, below 100 records.
Reloading is non-disruptive operation, old entries are removed
only after all entries are loaded. If entry is updated, shared
pointer will be atomically replaced in a cache map.
Prepare for use in a subsequent commit in group0_state_machine,
where the auth cache will be integrated. This follows the same
pattern as updates to the service-level cache, view-building
state, and CDC streams.
It combines data from all underlying auth tables.
Supports gentle full load and per role reloads.
Loading is done on shard 0 and then deep copies data
to all shards.
A new feature was announced this week for Amazon DynamoDB, "multi-
attribute composite keys in global secondary indexes", which allows to
create GSIs with composite keys (multiple columns). This feature already
existed in CQL's materialized views, but didn't exist in DynamoDB until
now.
So this patch adds a paragraph to our docs/alternator/compatibility.md
mentioning that we don't support this DynamoDB feature yet.
See also issue #27182 which we opened to track this unimplemented
feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27183
Sometimes (though rarely) I call this script on mis-matching PR and
current branch. E.g. trying to merge master PR into stable next, or
2025.X PR into next-2025.Y (X != Y). Typically merge fails, but it's
good to catch it early.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27249
Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability.
With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead.
In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments.
Fixes https://github.com/scylladb/scylladb/issues/26311
This patch needs to be backported to 2025.4.
Closesscylladb/scylladb#26897
* github.com:scylladb/scylladb:
docs/dev/view-building-coordinator: update the docs after recent changes
db/view/view_building: send coordinator's term in the RPC
db/view/view_building_state: replace task's state with `aborted` flag
db/view/view_building_coordinator: batch finished tasks reporting
db/view/view_building_worker: change internal implementation
db/view/view_building_coordinator: change `work_on_tasks` RPC return type
This PR adds support for limiting the maximum shares allocated to a
compaction scheduling class by the compaction controller. It introduces
a new configuration parameter, compaction_max_shares, which, when set
to a non zero value, will cap the shares allocated to compaction jobs.
This PR also exposes the shares computed by the compaction controller
via metrics, for observability purposes.
Fixes https://github.com/scylladb/scylladb/issues/9431
Enhancement. No need to backport.
NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696
Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50)
Closesscylladb/scylladb#27024
* github.com:scylladb/scylladb:
db/config: introduce new config parameter `compaction_max_shares`
compaction_manager:config: introduce max_shares
compaction_controller: add configurable maximum shares
compaction_controller: introduce `set_max_shares()`
The BlockIOWeight and CPUShares are deprecated. They are only used on
RHEL 7, which has reached end-of-life. Their replacements, IOWeight
and CPUWeight, are already set in the file.
Remove the deprecated settings to reduce noise in the logs.
Closesscylladb/scylladb#27222
When we delete a table in alternator, the schema change is performed on shard 0.
However, we actually use the storage_proxy from the shard that is handling the
delete_table command. This can lead to problems because some information is
stored only on shard 0 and using storage_proxy from another shard may make
us miss it.
In this patch we fix this by using the storage_proxy from shard 0 instead.
Fixes https://github.com/scylladb/scylladb/issues/27223Closesscylladb/scylladb#27224
There is a limited number of ways to obtain the schema of a table:
1) Use DESCRIBE TABLE in cqlsh
2) Find the schema definition in the code (for system tables)
3) Ask support/user to provide schema
4) Piece together the schema definition from the system tables
Option (1) is the most convenient but requires access to live cluster.
(2) is limited to system tables only.
When investigating issues for customers, we have to rely on (3) and this
often adds communication round-trips and delays. (4) requires knowledge
of ScyllaDB internals and access to system tables.
The new dump-schema commands provides a convenient way to obtain the
schema of tables, given that there is access to either an sstable or the
system tables. It can dump the schema of system tables without either.
Closesscylladb/scylladb#26433
In this patch we implement Alternator's support for gzip-compressed
requests, i.e., requests with the "Content-Encoding: gzip" header,
other uncompressed headers, and a gzip-compressed body.
The server needs to verify the signature of the *compressed* content,
and then uncompress the body before running the request.
We only support gzip compression because this is what DynamoDB supports.
But in the future we can easily add support for other compression
algorithms like lz4 or zstd.
This patch Refs #5041 but doesn't "Fixes" it because it only implements
compressed requests (Content-Encoding), *not* compressed responses
(Accept-Encoding).
The next patch will enable several tests for this feature and make sure
it behaves like DynamoDB.
Note that while we will have now support in our server for compressed
requests, just like DynamoDB does, the clients (AWS SDKs) will probably
NOT make use of it because they do not enable request compression by
default. For example, see the tests for some hoops one needs to jump
through in boto3 (the Python SDK) to send compressed requests. However,
we are hoping that in the future Alternator's modified clients will
use compressed requests and enjoy this feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
With the introduction of colocated tables, all the tablet transitions
now operate on groups of colocated tablets instead of individual
tablets. such is tablet migration, and also tablet repair.
The tablet repair currently doesn't work on individual tablets due to
the limitations in the tablet map being shared. The way it was
implemented to work on a group of colocated tablets is by repairing all
the colocated tablets together, using a dedicated rpc, and setting a
shared repair_time in the shared tablet map. It was implemented this
way because we wanted to have some way to repair the tablets of a
colocated table.
However, we want to change this in the next release so that it will be
possible to repair the tablets of a colocated table individually. In
order to simplify and prepare for the future change, we prefer until
then to not repair colocated tables at all. otherwise, we will need to
support both the shared repair and individual repair together for a long
time, and the upgrade will be more complicated.
We change the handling of the tablet 'repair' transition to repair only
the base table's tablets. It means it will not be possible to request
tablet repair for a non-base colocated table such as local MV, CDC and
paxos table. This restriction will be temporary until a later release
where we will suuport repairing colocated tablets.
This is a reasonable restriction because repair for these kind of tables
is not required or as important as for normal tables.
Fixes https://github.com/scylladb/scylladb/issues/27119
backport to 2025.4 since we must change it in the same version it's introduced before it's released
Closesscylladb/scylladb#27120
* github.com:scylladb/scylladb:
tombstone_gc: don't use 'repair' mode for colocated tables
Revert "storage service: add repair colocated tablets rpc"
topology_coordinator: don't repair colocated tablets
fix: windows gitbash support
fix: new name found with no group vector_search/vector_store_client.cc 343
fix: rm allowmismatch
fix: git bash (windows) compatibility
fix: git bash (windows) compatibility
Closesscylladb/scylladb#26173
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.
To avoid case when an old coordinator (which hasn't been stopped yet)
dictates what should be done, add raft term to the `work_on_view_building_tasks`
RPC.
The worker needs to check if the term matches the current term from raft
server, and deny the request when the term is bad.
After previous commits, we can drop entire task's state and replace it
with single boolean flag, which determines if a task was aborted.
Once a task was aborted, it cannot get resurrected to a normal state.
In previous implementation to execute view building tasks, the
coordinator needed to firstly set their states to `STARTED`
and then it needed to remove them before it could start the next ones.
This logic required a lot of group0 commits, especially in large
clusters with higher number of nodes and big tablet count.
After previous commit to the view building worker, the coordinator
can start view building tasks without setting the `STARTED` state
and deleting finished tasks.
This patch adjusts the coordinator to save finished tasks locally,
so it can continue to execute next ones and the finished tasks are
periodically removed from the group0 by `finished_task_gc_fiber()`.
To support debian13, we need to modify the installation instructions since `apt-key` command is no longer available
Also updated installation instruction to match the latest release
Fixes: https://github.com/scylladb/scylladb/issues/26673
**No need for backport since we added debian13 only in master for now**
Closesscylladb/scylladb#27205
* github.com:scylladb/scylladb:
install-on-linux.rst: update installation example to supported release
docs: modify debian/ubutnu installation instructions
The vector-search-validator is a binary tool which do functional and integration tests between scylla and vector-store. It is build in Rust mainly in vector-store repository. This patch adds possibility to write tests on scylladb repository side, compile them together with vector-store tests and run them in `test.py` environment.
There are three parts of the change:
- add sources of validator to the `test/vector_search_validator` directory
- add support for building validator and vector-store in `build/vector-search-validator/bin` directory with or without cmake
- add support for `pytest` and `test.py` to run validator test locally and in the CI environment; this part adds also README to the `test/vector_search_validator` directory
Design for validator integration tests: https://scylladb.atlassian.net/wiki/spaces/RND/pages/39518215/Vector+Search+Core+Test+Plan+Document
References: VECTOR-50
No backport needed as this is a new functionality.
Closesscylladb/scylladb#26653
* github.com:scylladb/scylladb:
vector_search: add vector-search-validator tests
vector_search: implement building vector-search-validator
vector_search: add vector-search-validator sources
For tables of special types that can be located: MV, CDC, and paxos
table, we should not use tombstone_gc=repair mode because colocated
tablets are never repaired, hence they will not have repair_time set and
will never be GC'd using 'repair' mode.
This reverts commit 11f045bb7c.
The rpc was added together with colocated tablets in 2025.4 to support a
"shared repair" operation of a group of colocated tablets that repairs
all of them and allows also for special behavior as opposed to repairing
a single specific tablet.
It is not used anymore because we decided to not repair all colocated
tablets in a single shared operation, but to repair only the base table,
and in a later release support repairing colocated tables individually.
We can remove the rpc in 2025.4 because it is introduced in the same
version.
With the introduction of colocated tables, all the tablet transitions
now operate on groups of colocated tablets instead of individual
tablets. such is tablet migration, and also tablet repair.
The tablet repair currently doesn't work on individual tablets due to
the limitations in the tablet map being shared. The way it was
implemented to work on a group of colocated tablets is by repairing all
the colocated tablets together, using a dedicated rpc, and setting a
shared repair_time in the shared tablet map. It was implemented this
way because we wanted to have some way to repair the tablets of a
colocated table.
However, we want to change this in the next release so that it will be
possible to repair the tablets of a colocated table individually. In
order to simplify and prepare for the future change, we prefer until
then to not repair colocated tables at all. otherwise, we will need to
support both the shared repair and individual repair together for a long
time, and the upgrade will be more complicated.
We change the handling of the tablet 'repair' transition to repair only
the base table's tablets. It means it will not be possible to request
tablet repair for a non-base colocated table such as local MV, CDC and
paxos table. This restriction will be temporary until a later release
where we will suuport repairing colocated tablets.
This is a reasonable restriction because repair for these kind of tables
is not required or as important as for normal tables.
Fixesscylladb/scylladb#27119
Vector indexes are going to be supported only for tablets (see VECTOR-322).
As a result, tests using vector indexes will be failing when run with vnodes.
This change ensures tests using vector indexes run exclusively with tablets.
Fixes: VECTOR-49
Closesscylladb/scylladb#26843
The commit adds a functionality for `pytest` and `test.py` to run
`vector-search-validator` in `sudo unshare` environment. There are already two
tests - first parametrized `test_validator.py::test_validator[test-case-name]`
(run validator) and second `test_cargo_toml.py::test_cargo_toml` (check if the
current `Cargo.toml` for validator is correct).
Documentation for these tests are provided in `README.md`.
The commit adds targets building
`build/vector-search-validator/bin/{vector-store,vector-search-validator}. The
targets must be build for tests. They don't depend on build mode.
The commit adds target in `configure.py` and also in `cmake`.
The commit adds validator sources uses combination of local files and
vector-store's files. In `build-env` there are definition of vector-store git
repository and revision on which validator will be built. `cargo-toml-template`
is script for printing current `Cargo.toml` to the stdout. After updating
`build-env` developer needs to update new configuration with
`./cargo-toml-template > Cargo.toml`. Git revision is used in several places in
`Cargo.toml` and will be used for building `vector-store`, so for better
handling git revision it should be setup only in one place.
The validator is divided into several crates to be able to built it within
scylladb and vector-store repositories. Here we need to create a new validator
crate with simple `main` function and call `validator_engine::main` there. We
provide tests written in scylladb repo in `validator-scylla` crate. The commit
provides empty `cql` test case, which should be filled in the future.
Currently if a banned node tries to connect to a cluster it fails to
create connections, but has no idea why, so from inside the node it
looks like it has communication problems. This patch adds new rpc
NOTIFY_BANNED which is sent back to the node when its connection is
dropped. On receiving the rpc the node isolates itself and print an
informative message about why it did so.
Closesscylladb/scylladb#26943
Add support for the new configuration parameter `compaction_max_shares`,
and update the compaction manager to pass it down to the compaction
controller when it changes. The shares allocated to compaction jobs will
be limited by this new parameter.
Fixes#9431
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Introduce an updateable value `max_shares` to compaction manager's
config. Also add a method `update_max_shares()` that applies the latest
`max_shares` value to the compaction controller’s `max_shares`. This new
variable will be connected to a config parameter in the next patch.
Refs #9431
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Add a `max_shares` constructor parameter to compaction_controller to
allow configuring the maximum output of the control points at
construction time. The constructor now calls `set_max_shares()` with the
provided max_shares value. The subsequent commits will wire this value
to a new configuration option.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Add a method to dynamically adjust the maximum output of control points
in the compaction controller. This is required for supporting runtime
configuration of the maximum shares allocated to the compaction process
by the controller.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator.
This patch adds a check for leaving and pending replicas, and only performs the tablet size migration if neither are empty.
This bug was introduced in 10f07fb95a
This change also adds the ability to create a tablet size in load_stats during end_migration stage of a tablet rebuild. We compute the new tablet size from by averaging the tablet sizes of the existing replicas.
This change also adds the virtual table tablet_sizes which contains tablet sizes of all the replicas of all the tablets in the cluster.
A version containing this bug has not yet been released, so a backport is not needed.
Closesscylladb/scylladb#27118
* github.com:scylladb/scylladb:
test: add tests for tablet size migration during end_migration
virtual_table: add tablet_sizes virtual table
load_stats: update tablet sizes after migration or rebuild
Example of installation is out of date, since scylla-5.2 is EOL for long time
upding the example for more recent release (together with packages update)
This commit fixes the information about object storage:
- Object storage configuration is no longer marked as experimental.
- Redundant information has been removed from the description.
- Information related to object storage for SStabels has been removed
as the feature is not working.
Fixes https://github.com/scylladb/scylladb/issues/26985Closesscylladb/scylladb#26987
This commit doesn't change the logic behind the view building worker but
it changes how the worker is executing view building tasks.
Previously, the worker had a state only on shard0 and it was reacting to
changes in group0 state. When it noticed some tasks were moved to
`STARTED` state, the worker was creating a batch for it on the shard0
state.
The RPC call was used only to start the batch and to get its result.
Now, the main logic of batch management was moved to the RPC call
handler.
The worker has a local state on each shard and the state
contains:
- unique ptr to the batch
- set of completed tasks
- information for which views the base table was flushed
So currently, each batch lives on a shard where it has its work to do
exclusively. This eliminates a need to do a synchronization between
shard0 and work shard, which was a painful point in previous
implementation.
The worker still reacts to changes in group0 view building state, but
currently it's only used to observe whether any view building tasks was
aborted by setting `ABORTED` state.
To prepare for further changes to drop the view building task state,
the worker ignores `IDLE` and `STARTED` states completely.
During the initial implementation of the view builing coordinator,
we decided that if a view building task fails locally on the worker
(example reason: view update's target replica is not available),
the worker will retry this work instead of reporting a failure to the
coordinator.
However, we left return type of the RPC, which was telling if a task was
finished successfully or aborted.
But the worker doesn't need to report that a task was aborted, because
it's the coordinator, who decides to abort a task.
So, this commit changes the return type to list of UUIDs of completed
tasks.
Previously length of the returned vector needed to be the same as length
of the vector sent in the request.
No we can drop this restriction and the RPC handler return list of UUIDs
of completed tasks (subset of vector sent in the request).
This change is required to drop `STARTED` state in next commits.
Since Scylla 2025.4 wasn't released yet and we're going to merge this
patch before releasing, no RPC versioning or cluster feature is needed.
rust/Cargo.toml locks the cxx packages to version 1.0.83,
but install-dependencies.sh does not lock cxxbridge-cmd, part
of that ecosystem. Since cxx 1.0.189 broke compatibility with
1.0.83 (understandable, as these are all sub-packages of a single
repository), builds with newer cxxbridge-cmd are broken.
Fix by locking cxxbridge-cmd to the same version as the other
cxx subpackages.
Regenerated frozen toolchain with optimized clang from
https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gzhttps://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
Probably better done by building cxxbridge-cmd during the build
itself, but that is a deeper change.
Fixes#27176Closesscylladb/scylladb#27177
The prepare scripts uses 'reg' to verify we're not going to
overwrite an existing image. The 'reg' command is not
available in Fedora 43. Use 'skopeo' instead. Skopeo
is part of the podman ecosystem so hopefully will live longer.
Fixes#27178.
Closesscylladb/scylladb#27179
This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream.
These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculating and verifying the diges.
New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches.
Backport is not required, since it is a new feature
Fixes#21776Closesscylladb/scylladb#26702
* github.com:scylladb/scylladb:
file_stream_test: add sstable file streaming integrity verification test cases
streaming: prioritize sender-side errors in tablet_stream_files
sstables: enable integrity check for data file streaming
sstables: Add compressed raw streaming support
sstables: Allow to read digest and checksum from user provided file instance
sstables: add overload of data_stream() to accept custom file_input_stream_options
In ALTER KEYSPACE, when a datacenter name is omitted, its replication
factor is implicitly set to zero with vnodes, while with tablets,
it remains unchanged.
ALTER KEYSPACE should behave the same way for tablets as it does
for vnodes. However, this can be dangerous as we may mistakenly
drop the whole datacenter.
Reject ALTER KEYSPACE if it changes replication factor, but omits
a datacenter that currently contains tablet replicas.
Fixes: https://github.com/scylladb/scylladb/issues/25549.
Closesscylladb/scylladb#25731
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Closesscylladb/scylladb#26941
* github.com:scylladb/scylladb:
address_map: Use barrier() to wait for replication
address_map: Use more efficient and reliable replication method
utils: Introduce helper for replicated data structures
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make them the new default.
If we change our mind, this change can be reverted later.
New functionality, and this is a drastic change. No backport needed.
Closesscylladb/scylladb#26377
* github.com:scylladb/scylladb:
db/config: enable `ms` sstable format by default
cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
api/system: add /system/chosen_sstable_version
test/cluster/dtest: reduce num_tokens to 16
vector_search: Add HTTPS support for vector store connections
This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file
To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.
Fixes: VECTOR-327
Backport to 2025.4 as this feature is expected to be available in 2025.4.
Closesscylladb/scylladb#26935
* github.com:scylladb/scylladb:
test: vector_search: Ensure all clients are stopped on shutdown
vector_search: Add HTTPS support for vector store connections
A flaky test revealed that after `clients::stop()` was called,
the `old_clients` collection was sometimes not empty,
indicating that some clients were not being stopped correctly.
This resulted in sanitizer errors when objects went out of scope at the end of the test.
This patch modifies `stop()` to ensure all clients, including those in `old_clients`,
are stopped, guaranteeing a clean shutdown.
This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file
To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.
Fixes: VECTOR-327
This change adds tests for the correctness of tablet size migration
during the end_migrations stage. This size migration can happend for
tablet migrations and for tablet rebuild.
When migrating tablet size during the end_migration tablet transition
stage, we need the pending and leaving replica hosts. The leaving and
pending replicas are gathered in objects of type
std::optional<tablet_replica> and are not checked if they contain a
value before dereferencing which could cause an exception in the
topology coordinator.
This patch adds a check for leaving and pending replicas, and only
perfoms the tablet size migration if neither are empty.
This bug was introduced in 10f07fb95a
This change also adds the functionality to add the tablet size to
load_stats after a tablet rebuild. We compute the average tablet size
from the existing replicas, and add the new size to the pending replica.
This PR:
- Removes n1-highmem instances from Recommended Instances.
- Adds missing support for n2-highmem-96.
- Updates the reference to n2 instances in the Google Cloud docs (fixes a broken link to GCP).
- Adds the missing information about processors for n2-highmem-instance - Ice Lake and Cascade Lake (requested by CX).
Fixes https://github.com/scylladb/scylladb/issues/25946
Fixes https://github.com/scylladb/scylladb/issues/24223
Fixes https://github.com/scylladb/scylladb/issues/23976
No backport needed if this PR is merged before 2025.4 branching.
Closesscylladb/scylladb#26182
* github.com:scylladb/scylladb:
doc: update information for n2-highmem instances
doc: remove n1-highmem instances from Recommended Instances
This commit updates the section for n2-highmem instances
on the Cloud Instance Recommendations page
- Added missing support for n2-highmem-96
- Update the reference to n2 instances in the Google Cloud docs.
- Added the missing information about processors for this instance
type (Ice Lake and Cascade Lake).
Add 'test_sstable_stream' to verify SSTable file streaming integrity check.
The new tests cover both compressed and uncompressed SSTables and includes:
- Checksum mismatch detection verification
- Digest mismatch detection verifivation
When 'send_data_to_peer' throws and
closes the sink, the peer later reports its own error, masking the
original sender failure.
This commit preserves the original sender exception.
If the status-retrieval task throws its own error before sender task rethrows its exception,
we can still propagate the original exception later.
This patch enables integrity check in 'create_stream_sources()' by introducing a new
'sstable_data_stream_source_impl' class for handling the Data component of
SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead
of the raw input_stream.
These additional checks require reading the digest and CRC components from
disk, which may introduce some I/O overhead. For uncompressed SSTables,
this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the
cost comes from reading, calculation and verifying the digest.
Implement compressed_raw_file_data_source that streams compressed chunks
without decompression while verifying checksums and calculating digests.
Extends raw_stream enum to support compressed_chunks mode.
This data_source implementation will be used in the next commits
for file based streaming.
Add overloaded methods to read digest and checksum from user-provided file
handles:
- 'read_digest(file f)'
- 'read_checksum(file f)
This will be useful for tablet file-based streaming to enable integrity verification, as the streaming code uses SSTable snapshots with open files to prevent missing components when SSTables are unlinked.
Trie-based sstable indexes are supposed to be (hopefully)
a better default than the old BIG indexes.
Make them the new default.
If we change our mind, this change can be reverted later.
Trie-based indexes and older indexes have a difference in metrics,
and the test uses the metrics to check for bypass cache.
To choose the right metrics, it uses highest_supported_sstable_format,
which is inappropriate, because the sstable format chosen for writes
by Scylla might be different than highest_supported_sstable_format.
Use chosen_sstable_format instead.
Returns the sstable version currently chosen for use in for new sstables.
We are adding it because some tests want to know what format they are
writing (tests using upgradesstable, tests which check stats that only
apply to one of the index types, etc).
(Currently they are using `highest_supported_sstable_format` for this
purpose, which is inappropriate, and will become invalid if a non-latest
format is the default).
Currently, the PRUNE MATERIALIZED VIEW statement performs all its
reads and writes in a single, continous sequence. This takes too
much time even for a moderate amount of 'PRUNED' data.
Instead, we want to make it possible to set a concurrency of the
reads and writes performed while processing the PRUNE statement,
so that if the user so desires, it may finish the PRUNING quicker
at the cost of adding more load on the cluster.
In this patch we add the CONCURRENCY setting to the USING clause
in cql. In the next patch, we'll be using it to actually set the
concurrency of PRUNE MATERIALIZED VIEW.
Add a test which verifies that if two nodes have the same data, with
different timestamps, repair will detect and fix the diverging
timestamps.
All our repair tests focus on difference in data and I remember writing
this test multiple times in the past to quickly verify whether this
works. Time to upstream this test.
Closesscylladb/scylladb#26900
A JSON assert happens when a JSON member is either missing or has
unexpected type. rapidjson has a very unhelpful "json assert failed"
message for this, with a backtrace (somewhat helpful), with no other
context. To help debug such errors, collect all request sent to the API
and dump them when such errors happen. The backtrace with the full
request history should be enough to debug any such issues.
Refs CUSTOMER-17
Closesscylladb/scylladb#26899
Remove the FIXME comment for re-enabling caching of the large tables
since the tables are used infrequently [1].
[1] : github.com/scylladb/scylladb/pull/26789#issuecomment-3477540364
Fixes#26032
Signed-off-by: Gautam Menghani <gautam.opensource@gmail.com>
Closesscylladb/scylladb#26789
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime
Closesscylladb/scylladb#26617
In commit 51186b2 (PR #25457) we introduced new statistics for
authentication errors, and among other places we modified
executor::create_table() to update them when necessary.
This function runs its real work (create_table_on_shard0()) on shard
0, but incorrectly updates "_stats" from the original shard. It doesn't
really matter which shard's stats we update - but it does matter that
code running on shard 0 shouldn't touch some other shard's objects.
Since all we do on these stats is to increment an integer, the risk
of updating it on the wrong shard is minimal to non-existant, but it's
still wrong and can cause bigger trouble in the future as the code
continues to evolve.
The fix is simple - we should pass to create_table_on_shard0() the
_stats object from the acutal shard running it (shard 0).
Fixes#26942
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26944
DynamoDB's documentation
https://docs.aws.amazon.com/sdkref/latest/guide/feature-compression.html
suggests that DynamoDB allows request bodies to be compressed (currently
only by gzip).
The purpose of patch is to have a test reproducing this feature. The
test shows us that indeed DynamoDB understands compressed requests using
the "gzip" encoding, but Alternator does *not*, so the new test is xfail.
As you can see in the test code, although the low-level SDK (botocore)
can send compress requests, this is not actually enabled for DynamoDB
and we need to resort to some trickery to send compressed requests.
But the point is that once we do manage to send compressed requests,
the test shows us that they work properly on AWS, but fail on
Alternator.
The failure of the compressed requests on Alternator is reported like:
An error occurred (ValidationException) when calling the PutItem
operation: Parsing JSON failed: Invalid value. at 70459088
This error message should probably be improved (what is that high
number?!) but of course even better would be to make it really work.
By enabling tracing on alternator-server (e.g., edit test/cqlpy/run.py
and add `'--logger-log-level', 'alternator-server=trace',`) we can
see exactly what request the SDK sends Alternator. What we can see in
the request is:
1. The request headers are uncompressed (this is expected in HTTP)
2. There is a header "Content-Encoding: gzip"
3. The request's body is binary, a full-fleged gzip output complete with
a gzip magic in the beginning.
Refs #5041
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27049
Add documentation for the quarantine/ subdirectory that holds SSTables
isolated due to validation failures or corruption. Document the scrub
operation's quarantine_mode parameter options and the drop_quarantined_sstables
API operation.
Also update the directory hierarchy example to include the quarantine directory.
Fixes#10742
Signed-off-by: Shreyas Ganesh <vansi.ganeshs@gmail.com>
Closesscylladb/scylladb#27023
On clang 21.1.4 (Fedora 43) the abseil compilation started to fail with `builtin XXX is deprecated use YYY instead`. Suppress this for abseil compilation only
Closesscylladb/scylladb#27098
Currently sstables_manager keeps a reference on global db::config to configure itself. Most of other services use their own specific configs with much less data on-board for the same purposes (e.g. #24841, #19051 and #23705 did same for other services) This PR applies this approach to sstables_manager as well.
Mostly it moves various values from db::config onto newly introduced struct sstables_manager::config, but it also adds specific tracking of sstable_file_io_extensions and patches tools/scylla-sstable not to use sstables_manager as "proxy" object to get db::config from along its calls.
Shuffling components dependencies, no need to backport
Closesscylladb/scylladb#27021
* github.com:scylladb/scylladb:
sstables_manager: Drop db::config from sstables_manager
tools/sstable: Make shard_of_with_tablets use db::config argument
tools/sstable: Add db::config& to all operations
tools/sstable: Get endpoints from storage manager
sstables_manager: Hold sstable IO extensions on it
sstables: Manager helper to grab file io extensions
sstables_manager: Move default format on config
sstables_manager: Move enable_sstable_data_integrity_check on config
sstables_manager: Move data_file_directories on config
sstables_manager: Move components_memory_reclaim_threshold on config
sstables_manager: Move column_index_auto_scale_threshold on config
sstables_manager: Move column_index_size on config
sstables_manager: Move sstable_summary_ratio on config
sstables_manager: Move enable_sstable_key_validation on config
sstables_manager: Move available_memory on config
code: Introduce sstables_manager::config
sstables: Patch get_local_directories() to work on vector of paths
code: Rename sstables_manager::config() into db_config()
When we enable CDC on a table, Scylla creates a log table for it.
It has default properties, but the user may change them later on.
Furthermore, it's possible to detach that log table by simply
disabling CDC on the base table:
```cql
/* Create a table with CDC enabled. The log table is created. */
CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true};
/* Detach the log table. */
ALTER TABLE ks.t WITH cdc = {'enabled': false};
/* Modify a property of the log table. */
ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13;
```
The log table can also be reattached by enabling CDC on the base table
again:
```cql
/* Reattach the log table */
ALTER TABLE ks.t WITH cdc = {'enabled': true};
```
However, because the process of reattachment goes through the same
code that created it in the first place, the properties of the log
table are rolled back to their default values. This may be confusing
to the user and, if unnoticed, also have other consequences, e.g.
affecting performance.
To prevent that, we ensure that the properties are preserved.
A reproducer test,
`test_log_table_preserves_properties_after_reattachment`, has been
provided to verify that the changes are correct.
Another test, `test_log_table_preserves_id_after_reattachment`, has
also been added because the current implementation sets properties
and the ID separately.
Fixesscylladb/scylladb#25523
Backport: not necessary. Although the behavior may be unexpected,
it's not a bug per se.
Closesscylladb/scylladb#26443
* github.com:scylladb/scylladb:
cdc: Preserve properties when reattaching log table
cdc: Extract creating columns in CDC log table to dedicated function
cdc: Extract default properties of CDC log tables to dedicated function
schema/schema_builder.hh: Add set_properties
schema: Add getter for schema::user_properties
schema: Remove underscores in fields of schema::user_properties
schema: Extract user properties out of raw_schema
Fixes#26874
Due to certain people (me) not being able to tell forward from backward,
the data alignment to ensure partial uploads adhere to the 256k-align
rule would potentially _reorder_ trailing buffers generated iff the
source buffers input into the sink are small enough. Which, as a fun fact,
they are in backup upload.
Change the unit test to use raw sink IO and add two unit tests (of which
the smaller size provokes the bug) that checks the same 64k buf segmented
upload backup uses.
Closesscylladb/scylladb#26938
Add a table name to Alternator's tracing output, as some clients would
like to consistently receive this information.
- add missing `tracing::add_table_name` in `executor::scan`
- add emiting tables' names in `trace_state::build_parameters_map`
- update tests, so when tracing is looked for it is filtered by table's
name, which confirms table is being outputed.
- change `struct one_session_records` declaration to `class one_session_records`,
as `one_session_records` is later defined as class.
Refs #26618Fixes#24031Closesscylladb/scylladb#26634
cluster.dtest_alternator_tests.test_slow_query_logging performs
a bootstrap with 768 token ranges.
It works with `me` sstables, which have 2 open file descriptors
per open sstable, but with `ms` sstables, which have 3 open
file descriptors per open sstable, it fails with EMFILE.
To avoid this problem, let's just decrease the number of vnodes
for in the test suite. It's appropriate anyway, because it avoids some
unneeded work without weakening the tests.
(Note: pylib-based have been setting `num_tokens` to 16 for a long time too).
This breaks `bypass_cache_test`, which is written in a way that expects
a certain number of token ranges. We adjust the relevant parameter
accordingly.
There are two issues with it.
First, a small RPC helper struct carries database reference on board just to get feature state from it.
Second, sstable_streamer uses database as proxy to feature service.
This PR improves both.
Services dependencies improvement, not need to backport
Closesscylladb/scylladb#26989
* github.com:scylladb/scylladb:
sstables_loader: Get LOAD_AND_STREAM_ABORT_RPC_MESSAGE from messaging
sstables_loader: Keep bool on send_meta, not database reference
Currently we do not allow tablet merge if either of the tablets contain
a tablet repair request. This could block the tablet merge for a very
long time if the repair requests could not be scheduled and executed.
We can actually merge the repair tasks in most of the cases. This is
because most of the time all tablets are requested to be repaired by a
single API request, so they share the same task_id, request_type and
other parameters. We can merge the repair task info and executes the
repair after the merge. If they do not share the task info, we could
not merge and have to wait for the repair before merge, which is both
rare and ok.
Another case is that one of the tablet has a repair task info (t1) while
the other tablet (t2) does not have, it is possible the t2 has finished
repair by the same repair request or t2 is not requested to be repaired
at all. We allow merge in this case too to avoid blocking the tablet
merge, with the price of reparing a bit more.
Fixes#26844Closesscylladb/scylladb#26922
Adds a new documentation page for the incremental repair feature.
The page covers:
- What incremental repair is and its benefits over the standard repair process.
- How it works at a high level by tracking the repair status of SSTables.
- The prerequisite of using the tablets architecture.
- The different user-configurable modes: 'regular', 'full', and 'disabled'.
Fixes#25600Closesscylladb/scylladb#26221
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes
It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.
Fixes#26229.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#27078
Fixes#27077
Multiple points can be clarified relating to:
* Names of each sub-procedure could be clearer
* Requirements of each sub-procedure could be clearer
* Clarify which keyspaces are relevant and how to check them
* Fix typos in keyspace name
Closesscylladb/scylladb#26855
This change makes the code simpler and less vulnerable to regressions.
There is no functional impact because:
- we already include a decommissioning/bootstrapping/replacing node for
`barrier` and `barrier_and_drain`,
- we never execute global commands in the presence of a rebuilding node,
- removing node always belongs to `exclude_nodes`, so it's filtered out
anyway,
- we execute global `stream_ranges` only for removenode,
- we execute global `wait_for_ip` only for new nodes when there are no
transitioning nodes.
Fixes#20272Fixes#27066Closesscylladb/scylladb#27102
According to the IETF spec uuid variant bits should be set to '10'. All
others are either invalid or reserved. The patch change the code to
follow the spec.
Closesscylladb/scylladb#27073
Update nodetool scrub documentation to include --quarantine-mode and --drop-unfixable-sstables options,
add a section explaining quarantine modes
and provide examples and procedures for handling and removing corrupted SSTables.
Closesscylladb/scylladb#27018
One of the tests in test_describe.py used "USE {test_keyspace}" which
affects the CQL session shared by all tests in an unrecoverable way (there
is no "UNUSE" statement).
As an example of what might happen if the shared CQL session is "polluted"
by a USE, issue #26334 is about a bug we have in DESC KEYSPACES when USE
is active.
So in this patch, we:
1. Fix the test to not use USE on the shared CQL session - it's easy to
create a separate session to use the "USE" on. With this fix, the test
no longer leaves the shared CQL session in a "USE" state.
2. Add a new xfailing test to reproduce the DESC KEYSPACES bug.
Refs #26334
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26345
The purpose of this two-patch series is to reproduce a previously unknown bug, Refs #26914.
Recently we saw a lot of patches that change how we create new schemas (keyspaces and tables), sometimes changing various long-time defaults. We started to worry that perhaps some of these defaults were applied only to CQL base tables and perhaps not to Alternator or to CQL's auxiliary tables (materialized views, secondary indexes, or CDC logs). For example, in Refs #26307 we wondered if perhaps the default "speculative_retry" option is different in Alternator than in CQL.
The first patch includes Alternator tests, and the second CQL tests. In both tests we discover that although recently (commit adf9c42, Refs #26610) we changed the default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor, actually this change was **only** applied to CQL base tables. All Alternator tables and all CQL auxiliary tables (views, indexes, CDC) still use the old LZ4Compressor. This is issue #26914.
Closesscylladb/scylladb#26915
* github.com:scylladb/scylladb:
test/cqlpy: test compression setting for auxiliary table
test/alternator: tests for schema of Alternator table
This change adds support for secondary vector store clients, typically
located in different availability zones. Secondary clients serve as
fallback targets when all primary clients are unavailable.
New configuration option allows specifying secondary client addresses
and ports.
Fixes: VECTOR-187
Closesscylladb/scylladb#26484
When streaming SSTables by tablet range, the original implementation of tablet_sstable_streamer::stream may break out of the loop too early when encountering a non-overlaping SSTable. As a result, subsequent SSTables that should be classified as partially contained are skipped entirely.
Tablet range: [4, 5]
SSTable ranges:
[0,5]
[0, 3] <--- is considered exhausted, and causes skip to next tablet
[2, 5] <--- is missed for range [4, 5]
The loop uses if (!overlaps) break; semantics, which conflated “no overlap” with “done scanning.” This caused premature termination when an SSTable did not overlapped but the following one did.
Correct logic should be:
before(sst_last) → skip and continue.
after(sst_first) → break (no further SSTables can overlap).
Otherwise → `contains` to classify as full or partial.
Missing SSTables in streaming and potential data loss or incomplete streaming in repair/streaming operations.
1. Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved.
2. Refactor the loop to use before() and after() checks explicitly, and only break when the SSTable is entirely after the tablet range
3. Add pytest to cover this case, full streaming flow by means of `restore`
4. Add boost tests to test the new refactored function
This data corruption fix should be ported back to 2024.2, 2025.1, 2025.2, 2025.3 and 2025.4
Fixes: https://github.com/scylladb/scylladb/issues/26979Closesscylladb/scylladb#26980
* github.com:scylladb/scylladb:
streaming: fix loop break condition in tablet_sstable_streamer::stream
streaming: add pytest case to reproduce mutation loss issue
vector_search: Fix error handling and status parsing
This change addresses two issues in the vector search client that caused
validator test failures: incorrect handling of 5xx server errors and
faulty status response parsing.
1. 5xx Error Handling:
Previously, a 5xx response (e.g., 503 Service Unavailable) from the
underlying vector store for an `/ann` search request was incorrectly
interpreted as a node failure. This would cause the node to be marked
as down, even for transient issues like an index scan being in progress.
This change ensures that 5xx errors are treated as transient search
failures, not node failures, preventing nodes from being incorrectly
marked as down.
2. Status Response Parsing:
The logic for parsing status responses from the vector store was
flawed. This has been corrected to ensure proper parsing.
Fixes: SCYLLADB-50
Backport to 2025.4 as this problem is present on this branch.
Closesscylladb/scylladb#27111
* github.com:scylladb/scylladb:
vector_search: Don't mark nodes as down on 5xx server errors
test: vector_search: Move unavailable_server to dedicated file
vector_search: Fix status response parsing
This patch adds a metric for pre-compression size of sstable files.
This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.
As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.
New functionality, no backport needed.
Closesscylladb/scylladb#26996
* github.com:scylladb/scylladb:
replica/table: add a metric for hypothetical total file size without compression
replica/table: keep track of total pre-compression file size
For an `/ann` search request, a 5xx server response does not
indicate that the node is down. It can signify a transient state, such
as the index full scan being in progress.
Previously, treating a 503 error as a node fault would cause the node
to be incorrectly marked as down, for example, when a new index was
being created. This commit ensures that such errors are treated as
transient search failures, not node failures.
* seastar 63900e03...340e14a7 (19):
> Merge 'rpc: harden sink_impl::close()' from Benny Halevy
rpc: sink_impl::close: fixup indentation
rpc: harden sink_impl::close()
> http: Document the way "unread body bytes" accounting works
> net: tighten port load balancing port access
> coroutine: reimplement generator with buffered variant
> Merge 'Stop using net::packet in posix data sink' from Pavel Emelyanov
net/posix-stack: Don't use packet in posix_data_sink_impl
reactor: Move fragment-vs-iovec static assertion
reactor: Make backend::sendmsg() calls use std::span<iovec>
utils: Introduce iovec_trim_front helper
utils: Spannize iovec_len()
> Merge 'Generalize memory data sink in tests' from Pavel Emelyanov
test: Make output_stream_test splitting test case use own sink
test: Make some output_stream_test cases use memory data sink
test: Threadify one of output_stream_test test cases
test: Make json_formatter_test use memory_data_sink
test: Move memory_data_sink to its own header
> dns: avoid using deprecated c-ares API
> reactor: Move read_directory() to posix_file_impl
> Merge 'rpc: sink_impl: batch sending and deletion of snd_buf:s' from Benny Halevy
test: rpc_test: add test_rpc_stream_backpressure_across_shards
reactor: add abort_on_too_long_task_queue option
rpc: make sink flush and close noexcept
rpc: sink_impl: batch sending and deletion of snd_buf:s
rpc: move sink_impl and source_impl into internal namespace
rpc: sink_impl: extend backpressure until snd_buf destroy
> configure.py: fix --api-level help
> Merge 'Close http client connection if handler doesn't consume all chunked-encoded body' from Pavel Emelyanov
test: Fix indentation after previous patch
test/http: Extend test for improper client handling of aborted requests
test/http: Ignore EPIPE exception from server closing connection
test/http: Split the partial response body read test
http: Track "remaining bytes" for chunked_source_impl
http: Switch content_length_source_impl to update remaining bytes
> metrics: Add default ~config()
> headers: Remove smp.hh from app-template.hh
> prometheus: remove hostname and metric_help config
> rpc: Tune up connection methods visibility
> perf_tests: Fix build with fmt 12.0.0 by avoiding internal functions
> doc: Fix some typos in codying style
> reactor: Remove unused try_sleep() method
directory_lister::get is adjusted in this patch to
use the new experimental::coroutine::generator interface
that was changed in scylladb/seastar@81f2dc9dd9
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#26913
In pull request #26960 there was some discussion about what is the valid form of ExclusiveStartKey, and whether we need to allow some "non-standard" uses of it for scan over system tables (which aren't real Alternator tables and may have multiple key columns, something not possible in normal Altenrator tables). This made me realize our tests for what is allowed - and what is not allowed - in ExclusiveStartKey - are very sparse and don't cover all the cases that are possible in Scan and Query, in base tables and in GSIs.
So this small series attempts to increase the coverage of the tests for ExclusiveStartKey to make sure we are compatible with DynamoDB and also that we don't regress in #26960.
The new tests reproduce a previously unknown error-path issues, #26988, where in some cases DynamoDB considers ExclusiveStartKey to be invalid but Alternator erronously accepts. Fortunately, we didn't find any success-path (correctness) bugs.
Closesscylladb/scylladb#26994
* github.com:scylladb/scylladb:
test/alternator: tests for ExclusiveStartKey in GSI
test/alternator: more tests for ExclusiveStartKey in Scan
test/alternator: more tests for ExclusiveStartKey in Query
Having variadic dispose_gently() clear inputs concurrently
serves no purpose, since this is a CPU bound operation. It
will just add more tasks for the reactor to process.
Reduce disruption to other work by processing inputs
sequentially.
Closesscylladb/scylladb#26993
Increase the value of the max-networking-io-control-blocks option
for the cpp tests as it is too low and causes flakiness
as seen in vector_search.vector_store_client_test.vector_store_client_single_status_check_after_concurrent_failures:
```
seastar/src/core/reactor_backend.cc:342: void seastar::aio_general_context::queue(linux_abi::iocb *): Assertion `last < end` failed.
```
See also https://github.com/scylladb/seastar/issues/976Fixes#27056
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27117
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.
More efficient than 100 pings.
There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Fixes#26835
Key goals:
- efficient (batching updates)
- reliable (no lost updates)
Will be used in data structures maintained on one designed owning
shard and replicated to other shards.
Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.
This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.
Fix that, and add some test that catches this.
Fixesscylladb/scylladb#27082Closesscylladb/scylladb#27083
The response was incorrectly parsed as a plain string and compared
directly with C++ string. However, the body contains a JSON string,
which includes escaped quotes that caused comparison failures.
In the previous patch we noticed that although recently (commit adf9c42,
Refs #26610) we changed the default sstable compressor from LZ4Compressor
to LZ4WithDictsCompressor, this change was only applied to CQL, not to
Alternator.
In this patch we add tests that demonstrate that it's even worse - the
new compression only applies to CQL's *base* table - all the "auxiliary"
tables -
* Materialized views
* Secondary index's materialized views
* CDC log tables
all still have the old LZ4Compressor, different from the base table's
default compressor.
The new test fails on Scylla, reproducing #26914, and passes on
Cassandra (on Cassandra, we only compare the materialized view table,
because SI and CDC is implemented differently).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch introduces a new test that exposed a previously unknown bug,
Refs #26914:
Recently we saw a lot of patches that change how we create new schemas
(keyspaces and tables), sometimes changing various long-time defaults.
We started to worry that perhaps some of these defaults were applied only
to CQL and not to Alternator. For example, in Refs #26307 we wondered if
perhaps the default "speculative_retry" option is different in Alternator
than in CQL.
This patch includes a new test file test/alternator/test_cql_schema.py,
with tests for verifying how Alternator configures the underlying tables
it creates. This test shows that the "speculative_retry" doesn't have
this suspected bug - it defaults to "99.0PERCENTILE" in both CQL and
Alternator. But unfortunately, we do have this bug with the "compression"
option:
It turns out that recently (commit adf9c42, Refs #26610) we changed the
default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor,
but the change was only applied to CQL, not Alternator. So the test that
"compression" is the same in both fails - and marked "xfails" and
I created a new issue to track it - #26914.
Another test verifies that Alternators "auxiliary" tables - holding
GSIs, LSIs and Streams - have the same default properties as the base
table. This currently seems to hold (there is no bug).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The feature in question is about the way streaming sink-and-source
operate. Since sink-and-source itself are obtained from messaging
service, the feature is better be fetched from it too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The send_meta helper needs database reference to get feature_service
from it (to check some feature state). That's too much, features are
"immutable" throug the loader lifetime, it's enough to keep the boolean
member on send_meta.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.
scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.
Fixes#27055Closesscylladb/scylladb#27075
This patch adds the missing warning about the lack of possibility
to return the similarity distance. This will be added in the next
iteration.
Fixes#27086
It has to be backported to 2025.4 as this is the limitation in 2025.4.
Closesscylladb/scylladb#27096
Introduce a test that demonstrates mutation loss caused by premature
loop termination in tablet_sstable_streamer::stream. The code broke
out of the SSTable iteration when encountering a non-overlapping range,
which skipped subsequent SSTables that should have been partially
contained. This test showcases the problem only.
Example:
Tablet range: [4, 5]
SSTable ranges:
[0,5]
[0, 3] <--- is considered exhausted, and causes skip to next tablet
[2, 5] <--- is missed for range [4, 5]
Following 9b6ce030d0 ("sstables: remove quadratic (and possibly
exponential) compile time in parse()"), where we removed recursion
in reading, we do the same here for variadic write. This results
in a small reduction in compile time.
Note the problem isn't very bad here. This is tail-recursion, so likely
removed by the compiler during optimization, and we don't have additional
amplification due to future::then() double-compiling the ready-future
and unready-future paths. Still, better to avoid quadratic compile
times.
Closesscylladb/scylladb#27050
This reverts commit 43738298be.
This commit causes instability in dtests. Several non-gating dtests
started failing, as well as some gating ones, see #27047.
Closesscylladb/scylladb#27067Fixes#27047
The updates include:
- adding missing parts like topology states and table rows,
- documenting zero-token nodes,
- replacing the old recovery procedure with the new one.
Fixes#26412
Updates of internal docs (usually read on master) don't require
backporting.
Closesscylladb/scylladb#27022
* github.com:scylladb/scylladb:
docs/dev/topology-over-raft: update the recovery section
docs/dev/topology-over-raft: document zero-token nodes
docs/dev/topology-over-raft: clarify the lack of tablet-specific states
docs/dev/topology-over-raft: add the missing join_group0 state
docs/dev/topology-over-raft: update the topology columns
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.
Fixes https://github.com/scylladb/scylladb/issues/26866
Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing.
Closesscylladb/scylladb#26868
* https://github.com/scylladb/scylladb:
cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
cleanup: Add RESTful API to allow reset cleanup needed flag
The executor::add_stream_options() obtains local database reference from
proxy just to get feature service from it.
Similar chain is used in executor::update_time_to_live().
It's shorter to get features from proxy itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26973
…played
Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection.
Update _last_replay only if all batches were replayed.
Fixes: https://github.com/scylladb/scylladb/issues/24415.
Needs backport to all live versions.
Closesscylladb/scylladb#26793
* github.com:scylladb/scylladb:
test: extend test_batchlog_replay_failure_during_repair
db: batchlog_manager: update _last_replay only if all batches were replayed
After in the previous patches we added more exhaustive testing for
the ExclusiveStartKey feature of Query and Scan, in this patch we
add tests for this feature in the context of GSIs.
Most interestingly, the ExclusiveStartKey when querying a GSI
isn't just the key of the GSI, but also includes the key columns
of the base - in other words, it is the key that Scylla uses for
its materialized view.
The tests here confirm that paging on GSI works - this paging uses
ExclusiveStartKey of course - but also what is the specific structure
and meaning of the content of ExclusiveStartKey.
We also include two xfailing tests which again, like in the previous
patches, show we don't do enough validation (issue #26988) and
don't recognize wrong values or spurious columns in ExclusiveStartKey.
As usual, all new tests pass on DynamoDB, and all except the xfailing
ones pass on Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the previous patch we added more tests for ExclusiveStartKey
in the context of the "Query" request. Here we do a similar thing
for "Scan". There are fewer error cases for Scan. In particular,
while it didn't make sense to use ExclusiveStartKey on a Query on a
table without a sort key (since a Query there always returns a single
item), for Scan it's needed - for paging. So we add in this patch
a test (that we didn't have before!) that Scan paging works correctly
also in the case of a table without a sort key.
This patch has one xfailing test reproducing #26988, that we don't
recognize and refuse spurious columns (columns not in the key) in
ExclusiveStartKey.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have in test/alternator/test_query.py a test -
test_query_exclusivestartkey - for one successful uses of
ExclusiveStartKey. But we missed testing quite a few edge cases of
this parameter, so this patch adds more tests for it - see the
comments on each individual test explaining its purpose.
With the new tests, we actually identified three cases where we got
the error handling wrong - cases of ExclusiveStartKey which DynamoDB
refuses, but Alternator allows. So three of the tests included here
pass on DynamoDB but fail on Alternator, so are marked with "xfail".
Refs #26988 - which is a new issue about these three cases of missing
validation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Previously, `wait_until_driver_service_level_created` only waited for
the `driver` service level to appear in the output of
`LIST ALL SERVICE_LEVELS`. However, the fact that one node lists
`sl:driver` does not necessarily mean that all other nodes can see
it yet. This caused sporadic test failures, especially in DEBUG builds.
To prevent these failures, this change adds an extra wait for
a `raft/read_barrier` after the `driver` service level first appears.
This ensures the service level is globally visible across the cluster.
Fixes: scylladb/scylladb#27019
Pass a ManagerClient instead of a `cql` session to
`wait_until_driver_service_level_created`. This makes it easier
to add additional functionality to the helper later (e.g. waiting for
a Raft read barrier in a subsequent commit).
Refs: scylladb/scylladb#27019
This patch adds support for multiple audit log outputs.
If only one audit log output is enabled, the behavior does not change.
If multiple audit log outputs are enabled, then the `audit_composite_storage_helper` class is used. It has a collection
of `storage_helper` objects.
Performance testing shows that read query throughput and auth request throughput are consistent even at high reactor utilization. It can also be observed that read query latency increases a bit.
Read query ops = 60k/s
AUTH ops = 200/s
| Audit Mode | QUERY latency (p99) | Δ% vs none |
|------------|---------------------|------------|
| none | 777 | 0 |
|table| 801 | +3.09% |
|syslog | 803 | +3.35% |
|table,syslog | 818 | +5.28% |
Read query ops = 50k/s
AUTH ops = 200/s
| Audit Mode | QUERY latency (p99) | Δ% vs none |
|------------|---------------------|------------|
| none | 643 | 0 |
|table| 647 | +0.62% |
|syslog | 648 | +0.78% |
|table,syslog | 656 | +2.02% |
Detailed performance results are in the following Confluence document: [Audit performance impact test](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test)
Fixes#26022
Backport:
The decision is to not backport for now. After making sure it works on the latest release, and if there is a need, we can do it.
Closesscylladb/scylladb#26613
* github.com:scylladb/scylladb:
test: dtest: audit_test.py: add AuditBackendComposite
test: dtest: audit_test.py: group logs in dict per audit mode
audit: write out to both table and syslog
audit: move storage helper creation from `audit::start` to `audit::audit`
audit: fix formatting in `audit::start_audit`
audit: unify `create_audit` and `start_audit`
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.
A Vector Store node is now considered down if it returns an HTTP 500
server error. This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.
The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is explicitly 'SERVING'.
Fixes: VECTOR-187
Backport to 2025.4 as this feature is expected to be available in 2025.4.
Closesscylladb/scylladb#26413
* github.com:scylladb/scylladb:
vector_search: Improve vector-store health checking
vector_search: Move response_content_to_sstring to utils.hh
vector_search: Add unit tests for client error handling
vector_search: Enable mocking of status requests
vector_search: Extract abort_source_timeout and repeat_until
vector_search: Move vs_mock_server to dedicated files
When we enable CDC on a table, Scylla creates a log table for it.
It has default properties, but the user may change them later on.
Furthermore, it's possible to detach that log table by simply
disabling CDC on the base table:
```cql
/* Create a table with CDC enabled. The log table is created. */
CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true};
/* Detach the log table. */
ALTER TABLE ks.t WITH cdc = {'enabled': false};
/* Modify a property of the log table. */
ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13;
```
The log table can also be reattached by enabling CDC on the base table
again:
```cql
/* Reattach the log table */
ALTER TABLE ks.t WITH cdc = {'enabled': true};
```
However, because the process of reattachment goes through the same
code that created it in the first place, the properties of the log
table are rolled back to their default values. This may be confusing
to the user and, if unnoticed, also have other consequences, e.g.
affecting performance.
To prevent that, we ensure that the properties are preserved.
A reproducer test,
`test_log_table_preserves_properties_after_reattachment`, has been
provided to verify that the changes are correct. It fails before this
commit.
Another test, `test_log_table_preserves_id_after_reattachment`, has
also been added because the current implementation sets properties
and the ID separately.
Fixesscylladb/scylladb#25523
We extract the portion of the code responsible for creating columns
in a CDC log table to a separate, dedicated function. This should
improve the overall readability of the function (and also making it
very short now).
We extract the portion of the code responsible for setting the default
properties of a CDC log table to a separate function. This should
improve the overall readability of the function. Also, it should be
helpful when modifying the code later on in this commit series.
The properties can be directly manipulated by the user
via statements like `ALTER TABLE`. To better organize
the structure of `raw_schema`, we encapsulate that data
in the form of a dedicated struct. This change will be
later used for applying multiple properties to `schema_builder`
in one go.
Some of the columns were added, but the doc wasn't updated.
`upgrade_state` was updated in only one of the two places.
`ignore_nodes` was changed to a static column.
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.
To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.
The patch should be backported to 2025.4
Fixes https://github.com/scylladb/scylladb/issues/26244Closesscylladb/scylladb#26454
* github.com:scylladb/scylladb:
service/storage_service: migrate staging sstables in view building worker during intra-node migration
db/view/view_building_worker: support sstables intra-node migration
db/view_building_worker: fix indent
db/view/view_building_worker: don't organize staging sstables by last token
A Vector Store node is now considered down if it returns an HTTP 5xx status.
This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.
The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is 'SERVING'.
Move the response_content_to_sstring utility function from
vector_store_client.cc to utils.hh to enable reuse across
multiple files.
This refactoring prepares for the upcoming `client.cc` implementation
that will also need this functionality.
Introduce dedicated unit tests for the client class to verify existing
functionality and serve as regression tests.
These tests ensure that invalid client requests do not cause nodes to
be marked as down.
Extend the mock server to allow inspecting incoming status requests and
configuring their responses.
This enables client unit tests to simulate various server behaviors,
such as handling node failures and backoff logic.
The `abort_source_timeout` and `repeat_until` functions are moved to
the shared utility header `test/vector_search/utils.hh`.
This allows them to be reused by upcoming `client` unit tests, avoiding
code duplication.
It's not extremely elegant, but one tool operation needs db::config --
the "shard of" one. Currently it gets one from sstables_manager, but
manager is going to stop using db::config, and the operation needs to
get it elsehow.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tool may open sstables on S3. For that it gets configured endpoints
with the help of db::config obtained from sstables_manager.db_config().
However, storage endpoints are maintained by sstables storage manager,
and since tool has this instance, it's better to use storage manager to
get list of endpoints.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently manager holds a reference on db::config and when sstables IO
extensions are needed it grabs them from this config. Since db::config
is going to be removed from sstables manager, it should either keep
track of all config extensions, or only those that it needs. This patch
makes the latter choice and keeps reference to sstable_file_io_ext. on
manager. The reference is passed as constructor argument, not via
manager config, but it's a random choice, no specific reason why not
putting it on config itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently all the code that needs to iterate over sstables extensions
get config from manager, extensions from it and then iterate. Add a
helper that returns extensions directly. No real changes, just a helper.
Next patch will change the way the helper works.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's explicitly `me` type by default, but places that can write sstables
override it with db::config value: replica::database, tests and scylla
sstable tool.
Live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only the
replica::database and tests may want to re-configure it.
This one is live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.
This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.
To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.
Backport is not required, it is an improvement
Refs #21776Closesscylladb/scylladb#26444
* github.com:scylladb/scylladb:
boost/repair_test: add repair reader integrity verification test cases
test/lib: allow to disable compression in random_mutation_generator
sstables: Skip checksum and digest reads for unlinked SSTables
table: enable integrity checks for streaming reader
table: Add integrity option to table::make_sstable_reader()
sstables: Add integrity option to create_single_key_sstable_reader
This patch fixes 2 issues at one go:
First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.
Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.
Fixes#26982
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#26991
Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made
it hitting more often, as _ss was called earlier, but it's not released yet.
Closesscylladb/scylladb#26779
* github.com:scylladb/scylladb:
service: attach storage_service to migration_manager using pluggabe
service: migration_manager: corutinize merge_schema_from
service: migration_manager: corutinize reload_schema
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.
However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:
1. On the topology coordinator, the join completes and the joining node
becomes normal, but the joining node's state lags behind. Since it's
not synchronized by the barrier, it could be in an old state such as
`write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
new replica.
3. The new node applies the base mutation but doesn't generate a view
update for it, because it calculates the base-view pairing according
to its own state and replication map, and determines that it doesn't
participate in the base-view pairing.
Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.
Fixes https://github.com/scylladb/scylladb/issues/26976
backport to previous versions since it fixes a bug in MV with vnodes
Closesscylladb/scylladb#27008
* github.com:scylladb/scylladb:
test: add mv write during node join test
topology_coordinator: include joining node in barrier
Set its default value to the one from db/config.cc. Only the
replica::database may want to re-configure it.
This one is live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it OFF by default and update only those callers, that may have it
ON -- the replica::database, tests and scylla-sstable tool.
Also not live-updateable, so plain bool.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, this parameter is passed to sstables_manager as explicit
constructor argument.
Also, it's not live-updateable, so a plain size_t type for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is specific configuration for sstables_manager. All places that
construct sstables manager are updated to provide config to it. For now
the config is empty and exists alongside with db::config. Further
patches will populate the former config with data and the latter config
will be eventually removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it uses db::config. Next patches will eliminate db::config from this
code and the helper in question will need to get datadir names
explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The config() method name is going to return sstables_manager config, so
first need to set this name free.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`.
The PR improves codes readability and efficiency, no behavior changes.
backport: this is a refactoring/optimization, no need to backport
Closesscylladb/scylladb#26907
* https://github.com/scylladb/scylladb:
topology_coordinator: drop unused exec_global_command overload
topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
topology_state_machine: inline get_excluded_nodes
messaging_service: simplify and optimize ban_host
storage_service: topology_state_load: extract topology variable
topology_coordinator: excluded_tablet_nodes -> ignored_nodes
topology_state_machine: add excluded_tablet_nodes field
vector_search: Add backoff for failed nodes
Introduces logic to mark nodes that fail to answer an ANN request as
"down". Down nodes are omitted from further requests until they
successfully respond to a health check.
Health checks for down nodes are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
Client list management is moved to separate files (clients.cc/clients.hh)
to improve code organization and modularity.
References: VECTOR-187.
Backport to 2025.4 as this feature is expected to be available in 2025.4.
Closesscylladb/scylladb#26308
* github.com:scylladb/scylladb:
vector_search: Set max backoff delay to 2x read request timeout
vector_search: Report status check exception via on_internal_error_noexcept
vector_search: Extract client management into dedicated class
vector_search: Add backoff for failed clients
vector_search: Make endpoint available
vector_search: Use std::expected for low-level client errors
vector_search: Extract client class
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).
Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.
A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
Some accesses to `auth::service` are not affected and we do not modify
those.
Fixesscylladb/scylladb#26816
Backport: yes. This is a fix of a regression.
Closesscylladb/scylladb#26856
* github.com:scylladb/scylladb:
test/cluster/test_maintenance_mode.py: Wait for initialization
test: Disable maintenance mode correctly in test_maintenance_mode.py
test: Fix keyspace in test_maintenance_mode.py
service/qos: Do not crash Scylla if auth_integration absent
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.
One test needs adjustment as it relies on RBNO being used for all node
ops.
Fixes: #24664Closesscylladb/scylladb#26330
The view building coordinator manages the process by sending RPC
requests to all nodes in the cluster, instructing them what to do.
If processing that message fails, the coordinator decides if it
wants to retry it or (temporarily) abandon the work.
An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.
Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.
To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.
We provide a reproducer test.
Fixesscylladb/scylladb#26686
Backport: impact on the user. We should backport it to 2025.4.
Closesscylladb/scylladb#26729
* github.com:scylladb/scylladb:
tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
db/view/view_building_coordinator: Rate limit logging failed RPC
db/view: Add backoff when RPC fails
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixes https://github.com/scylladb/scylladb/issues/26340
the issue affects all previous releases, backport to improve stability
Closesscylladb/scylladb#26533
* github.com:scylladb/scylladb:
test: test concurrent writes with column drop with cdc preimage
cdc: check if recreating a column too soon
cdc: set column drop timestamp in the future
migration_manager: pass timestamp to pre_create
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
Since error is not printed to stdout, when working with multiple
files, we don't know whith which sstable the error is associated with.
Closesscylladb/scylladb#27009
The maximum backoff delay for status checking now depends on the
`read_request_timeout_in_ms` configuration option. The delay is set
to twice the value of this parameter.
This exception should only occur due to internal errors, not client or external issues.
If triggered, it indicates an internal problem. Therefore, we notify about this exception
using on_internal_error_noexcept.
Introduces logic to mark clients that fail to answer an ANN request as
"down". Down clients are omitted from further requests until they
successfully respond to a health check.
Health checks for down clients are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
In preparation for a new feature, the tests need the ability to make
an endpoint that was previously unavailable, available again.
This is achieved by adding an `unavailable_server::take_socket` method.
This method allows transferring the listening socket from the
`unavailable_server` to the `mock_vs_server`, ensuring they both
operate on the same endpoint.
To unify error handling, the low-level client methods now return
`std::expected` instead of throwing exceptions. This allows for
consistent and explicit error propagation from the client up to the
caller.
The relevant error types have been moved to a new `vector_search/error.hh`
header to centralize their definitions.
This refactoring extracts low-level client logic into a new, dedicated
`client` class. The new class is responsible for connecting to the
server and serializing requests.
This change prepares for extending the `vector_store_client` to check
node status via the `api/v1/status` endpoint.
`/ann` Response deserialization remains in the `vector_store_client` as it
is schema-dependent.
During sstable summary parsing, the entire header was read into a single
buffer upfront and then parsed to obtain the positions. If the header
was too large, it could trigger oversized allocation warnings.
This commit updates the parse method to read one position at a time from
the input stream instead of reading the entire header at once. Since
`random_access_reader` already maintains an internal buffer of 128 KB,
there is no need to pre read the entire header upfront.
Fixes#24428
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26846
The view building coordinator sends tasks in form of RPC messages
to other nodes in the cluster. If processing that RPC fails, the
coordinator logs the error.
However, since tasks are per replica (so per shard), it may happen
that we end up with a large number of similar messages, e.g. if the
target node has died, because every shard will fail to process its
RPC message. It might become even worse in the case of a network
partition.
To mitigate that, we rate limit the logging by 1 seconds.
We extend the test `test_backoff_when_node_fails_task_rpc` so that
it allows the view building coordinator to have multiple tablet
replica targets. If not for rate limiting the warning messages,
we should start getting more of them, potentially leading to
a test failure.
The view building coordinator manages the process of view building
by sending RPC requests to all nodes in the cluster, instructing them
what to do. If processing that message fails, the coordinator decides
if it wants to retry it or (temporarily) abandon the work.
An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.
Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.
To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.
We provide a reproducer test: it fails before this commit and succeeds
with it.
Fixesscylladb/scylladb#26686
Currently we do not support paging for vector search queries.
When we get such a query with paging enabled we ignore the paging
and return the entire result. This behavior can be confusing for users,
as there is no warning about paging not working with vector search.
This patch fixes that by adding a warning to the result of ANN queries
with paging enabled.
Closesscylladb/scylladb#26384
add a test that writes to a table concurrently with dropping a column,
where the table has CDC enabled with preimage.
the test reproduces issue #26340 where this results in a malformed
sstable.
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.
To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixesscylladb/scylladb#26340
This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879
](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too.
Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported.
Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit.
Fixes#26369
This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1.
Closesscylladb/scylladb#26909
* github.com:scylladb/scylladb:
test: cqlpy: add test case for non-numeric PERCENTILE value
schema: speculative_retry: update exception type for sstring ops
docs: cql: ddl.rst: update speculative-retry-options
test: cqlpy: add test for valid speculative_retry values
schema: speculative_retry: allow 0 and 100 PERCENTILE values
This method is specific to topology requests -- node joining, replacing,
decommissioning etc, everything that goes through
topology::transition_state::write_both_read_old and
raft_topology_cmd::command::stream_ranges. It shouldn't be used in
other contexts -- to handle global topology requests
(e.g. truncate table) or for tablets. Rename the method to make this
more explicit.
The method is specific to topology_coordinator, which already contains
a wrapper for it, so inline the topology method into it.
Also, make the logic of the method more explicit and remove multiple
transition_nodes lookups.
Adds test cases to verify that repair_reader correctly detects SSTable(both comprossed and uncompressed) checksum mismatch.
Digest mismatch verification is not possible as repair readar may skip some sstable data, which automatically disables digest verification.
Each test corrupts the Data component on disk and ensures the reader throws a malformed_sstable_exception with the expected error message.
Adds a compress flag to random_mutation_generator, allowing tests to disable compression in generated mutations.
When set to compress::no, the schema builder uses no_compression() parameters.
Add an _unlinked flag to track SSTable unlink state and check it in
read_digest() and read_checksum() methods to skip file reads for
unlinked SSTables, preventing potential file not found errors.
Add a test that reproduces the issue scylladb/scylladb#26976.
The test adds a new node with delayed group0 apply, and does writes with
MV updates right after the join completes on the coordinator and while
the joining node's state is behind.
The test fails before fixing the issue and passes after.
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.
However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:
1. On the topology coordinator, the join completes and the joining node
becomes normal, but the joining node's state lags behind. Since it's
not synchronized by the barrier, it could be in an old state such as
`write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
new replica.
3. The new node applies the base mutation but doesn't generate a view
update for it, because it calculates the base-view pairing according
to its own state and replication map, and determines that it doesn't
participate in the base-view pairing.
Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.
Fixesscylladb/scylladb#26976
This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:
```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```
To avoid that, we should wait until initialization is complete.
Currently, if flushing hints falls within the repair cache timeout,
then the flush_time is set to batchlog_manager::_last_replay.
_last_replay is updated on each replay, even if some batches weren't
replayed. Due to that, we risk the data resurrection.
Update _last_replay only if all batches were replayed.
Fixes: https://github.com/scylladb/scylladb/issues/24415.
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.
The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.
We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.
When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.
The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.
Fixes https://github.com/scylladb/scylladb/issues/26405
backport not needed - enhancement
Closesscylladb/scylladb#24960
* github.com:scylladb/scylladb:
test: cdc: test cdc compatible schema
cdc: use compatiable cdc schema
db: schema_applier: create schema with pointer to CDC schema
db: schema_applier: extract cdc tables
schema: add pointer to CDC schema
schema_registry: remove base_info from global_schema_ptr
schema_registry: use extended_frozen_schema in schema load
schema_registry: replace frozen_schema+base_info with extended_frozen_schema
frozen_schema: extract info from schema_ptr in the constructor
frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with:
- `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only
- `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only.
- `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself)
- `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense.
The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope.
Fixes#26584Closesscylladb/scylladb#26609
* github.com:scylladb/scylladb:
Add cluster tests for checking scoped primary_replica_only streaming
Improve choice distribution for primary replica
Refactor cluster/object_store/test_backup
nodetool restore: add primary-replica-only option
nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
Enable scoped primary replica only streaming
Support primary_replica_only for native restore API
Every table and sstable set keeps track of the total file size
of contained sstables.
Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.
We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.
This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
`topology_cooridinator::migrate_tablet_size()` was introduced in 10f07fb95a. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search:
```
if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) {
if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) {
```
This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats.
This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method.
A version containing this bug has not been released yet, so no backport is needed.
Closesscylladb/scylladb#26946
* github.com:scylladb/scylladb:
load_stats: add test for migrate_tablet_size()
load_stats: fix problem with tablet size migration
cql3: Fix std::bad_cast when deserializing vectors of collections
This PR fixes a bug where attempting to INSERT a vector containing collections (e.g., `vector<set<int>,1>`) would fail. On the client side, this manifested as a `ServerError: std::bad_cast`.
The cause was "type slicing" issue in the reserialize_value function. When retrieving the vector's element type, the result was being assigned by value (using auto) instead of by reference.
This "sliced" the polymorphic abstract_type object, stripping it of its actual derived type information. As a result, a subsequent dynamic_cast would fail, even if the underlying type was correct.
To prevent this entire class of bugs from happening again, I've made the polymorphic base class `abstract_type` explicitly uncopyable.
Fixes: #26704
This fix needs to be backported as these releases are affected: `2025.4` , `2025.3`.
Closesscylladb/scylladb#26740
* github.com:scylladb/scylladb:
cql3: Make abstract_type explicitly noncopyable
cql3: Fix std::bad_cast when deserializing vectors of collections
ignored_nodes is sufficient in these cases. excluded_tablet_nodes
also includes left_nodes_rs, which are not needed
here — global_token_metadata_barrier runs the barrier only
on normal and transition nodes, not on left nodes.
The topology_coordinator::is_excluded() creates a temporary hash
map for each call. This is probably not a performance problem since
left_nodes_rs contains only those left nodes that are referenced
from tablet replicas, this happens temporarily while e.g. a replaced
node is being rebuilt. On the other hand, why not just have a
dedicated field in the topology_state_machine, then this code wouldn't
look suspicious.
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.
Recently we enabled tablets by default, but it is necessary to
enable rf_rack_valid_keyspaces if materialized views are to be used
with tablets, and this option is *not* the default.
We did add this option in test/pylib/scylla_cluster.py which is
used by test.py, but we didn't add it to test/cqlpy/run.py, so
the test/cqlpy/run script is no longer able to run tests with
materialized views. So this patch adds the missing configuration
to run.py.
FIxes#26918
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26919
The polymorphic abstract_type class serves as an interface and should not be copied.
To prevent accidental and unsafe copies, make it explicitly uncopyable.
When deserializing a vector whose elements are collections (e.g., set, list),
the operation raises a `std::bad_cast` exception.
This was caused by type slicing due to an incorrect assignment of a
polymorphic type by value instead of by reference. This resulted in a
failed `dynamic_cast` even when the underlying type was correct.
Refs #26822
AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff.
Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe.
Closesscylladb/scylladb#26934
* github.com:scylladb/scylladb:
encryption::kms_host: Add exponential backoff-retry for 503 errors
encryption::kms_host: Include http error code in kms_error
Refs #26822
AWS says to treat 503 errors, at least in the case of ec2 metadata
query, as backoff-retry (generally, we do _not_ retry on provider
level, but delegate this to higher levels). This patch adds special
treatment for 503:s (service unavailable) for both ec2 meta and
actual endpoint, doing exponential backoff.
Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing
(doh!). Should try to set up a mock ec2 meta with injected errors
maybe.
v2:
* Use utils::exponential_backoff_retry
This patch fixes a bug with tablet size migration in load_stats.
has_tablet_size() lambda in topology_coordinator::migrate_tablet_size()
was returning false in all cases due to incorrect search iterator
comparison after a table and tablet saeach.
This change moves load_stats migrate_tablet_sizes() functionaility
into a separate method of load_stats.
Add --blocked-reactor-notify-ms argument to allow overriding the default
blocked reactor notification timeout value of 25 ms.
This change provides users the flexibility to customize the reactor
notification timeout as needed.
Fixes: scylladb/scylla-enterprise#5525Closesscylladb/scylladb#26892
In the test translated from Cassandra validation/operations/alter_test.py
we had two lines in the beginning of an unrelated test that verified
that CREATE KEYSPACE is not allowed without replication parameters.
But starting recently, ScyllaDB does have defaults and does allow these
CREATE KEYSPACE. So comment out these two test lines.
We didn't notice that this test started to fail, because it was already
marked xfail, because in the main part of this test, it reproduces a
different issue!
The annoying side-affect of these no-longer-passing checks was that
because the test expected a CREATE KEYSPACE to fail, it didn't bother
to delete this keyspace when it finished, which causes test.py to
report that there's a problem because some keyspaces still exist at the
end of the test. Now that we fixed this problem, we no longer need to
list this test in test/cqlpy/suite.yaml as a test that leaves behind
undeleted keyspaces.
Fixes#26292
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26341
The migration manager offers some free functions to prepare mutations
for a new/updated table/view. Most of them include a validation check
for the schema extensions, but in the following ones it's missing:
* `prepare_new_column_family_announcement` (overload with vector as out parameter)
* `prepare_new_column_families_announcement`
Presumably, this was just an omission. It's also not a very important
one since the only extension having validation logic is the
`encryption_schema_extension`, but none of these functions is connected
to user queries where encryption options can be provided in the schema.
User queries go through the other
`prepare_new_column_family_announcement` overload, which does perform a
validation check.
Add validation in the missing places.
Fixes#26470.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#26487
Before this series, Alternator's CreateTable operation defaults to creating a table replicated with vnodes, not tablets. The reasons for this default included missing support for LWT, Materialized Views, Alternator TTL and Alternator Streams if tablets are used. But today, all of these (except the still-experimental Alternator Streams) are now fully available with tablets, so we are finally ready to switch Alternator to use tablets by default in new tables.
We will use the same configuration parameter that CQL uses, tablets_mode_for_new_keyspaces, to determine whether new keyspaces use tablets by default. If set to `enabled`, tablets are used by default on new tables. If set to `disabled`, tablets will not be used by default (i.e., vnodes will be used, as before). A third value, `enforced` is similar to `enabled` but forbids overriding the default to vnodes when creating a table.
As before, the user can set a tag during the CreateTable operation to override the default choice of tablets or vnodes (unless in `enforced` mode). This tag is now named `system:initial_tablets` - whereas before this patch it was called `experimental:initial_tablets`. The rules stay the same as with the earlier, experimental:initial_tablets tag: when supplied with a numeric value, the table will use tablets. When supplied with something else (like a string "none"), the table will use vnodes.
Fixes https://github.com/scylladb/scylladb/issues/22463
Backport to 2025.4, it's important not to delay phasing out vnodes.
Closesscylladb/scylladb#26836
* github.com:scylladb/scylladb:
test,alternator: use 3-rack clusters in tests
alternator: improve error in tablets_mode_for_new_keyspaces=enforced
config: make tablets_mode_for_new_keyspaces live-updatable
alternator: improve comment about non-hidden system tags
alternator: Fix test_ttl_expiration_streams()
alternator: Fix test_scan_paging_missing_limit()
alternator: Don't require vnodes for TTL tests
alternator: Remove obsolete test from test_table.py
alternator: Fix tag name to request vnodes
alternator: Fix test name clash in test_tablets.py
alternator: test_tablets.py handles new policy reg. tablets
alternator: Update doc regarding tablets support
alternator: Support `tablets_mode_for_new_keyspaces` config flag
Fix incorrect hint for tablets_mode_for_new_keyspaces
Fix comment for tablets_mode_for_new_keyspaces
`test -v` isn't present on the MacOS shell. Since dbuild is intended
as a compatibility bridge between the host environment and the build
environment, don't use it there.
Use ${var+text_if_set} expansion as a workaround.
Fixes#26937Closesscylladb/scylladb#26939
Add a test to check that paged secondary index queries behave correctly
when pages are short. This is currently failing in Scylla, but passes in
Cassandra 5, therefore marked as "xfailing". Refer to the test's
docstring for more details.
The bug is a regression introduced by commit f6f18b1.
`test/cqlpy/run --release ...` shows that the test passes in 5.1 but
fails in 5.2 onwards.
Refs #25839.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#25843
This commits adds a tests checking various scenarios of restoring
via load and stream with primary_replica_only and a scope specified.
The tests check that in a few topologies, a mutation is replicated
a correct amount of times given primary_replica_only and that
streaming happens according to the scope rule passed.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.
This patch sorts the replica set before passing it through the
scope filter.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This PR splits the suppport code from test_backup.py
into multiple functions so less duplicated code is
produced by new tests using it. It also makes it a bit
easier to understand.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Add --primary-replica-only and update docs page for
nodetool restore.
The relationship with the scope parameter is:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
So far it was not allowed to pass a scope when using
the primary_replica_only option, this patch enables
it because the concepts are now combined so that:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This patch removes the restriction for streaming
to primary replica only within a scope.
Node scope streaming to primary replica is dissallowed.
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Current native restore does not support primary_replica_only, it is
hard-coded disabled and this may lead to data amplification issues.
This patch extends the restore REST API to accept a
primary_replica_only parameter and propagates it to
sstables_loader so it gets correctly passed to
load_and_stream.
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:
* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.
For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.
The other uses of `auth_integration` within the service level controller
are not problematic:
* `find_effective_service_level`,
* `find_cached_effective_service_level`.
They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.
Fixesscylladb/scylladb#26816
Refactoring scylla-ci to be triggered directly from each PR using GitHub action. This will allow us to skip triggering CI when PR commit message was updated (which will save us un-needed CI runs) Also we can remove `Scylla-CI-route` pipeline which route each PR to the proper CI job under the release (GitHub action will do it automatically), to reduce complexity
Fixes: https://scylladb.atlassian.net/browse/PKG-69Closesscylladb/scylladb#26799
* Adds test fixture for AWS KMS
* Adds test fixture for Azure KMS
* Adds key provider proxy for Azure to pytests (ported dtests)
* Make test gather for boost tests handle suites
* Fix GCP test snafu
Fixes#26781Fixes#26780Fixes#26776Fixes#26775Closesscylladb/scylladb#26785
* github.com:scylladb/scylladb:
gcp_object_storage_test: Re-enable parallelism.
test::pylib: Add azure (mock) testing to EAR matrix
test::boost::encryption_at_rest: Remove redundant azure test indent
test::boost::encryption_at_rest: Move azure tests to use fixture
test::lib: Add azure mock/real server fixture
test::pylib::boost: Fix test gather to handle test suites
utils::gcp::object_storage: Fix typo in semaphore init
test::boost::encryption_at_rest_test: Remove redundant indent
test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test
test::boost::test_encryption_at_rest: Reorder tests and helpers
ent::encryption: Make text helper routines take std::string
test::pylib::dockerized_service: Handle docker/podman bind error message
test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS
test::lib::gcs_fixture: Only set port if running docker image + more retry
worker during intra-node migration
Use methods introduces in previous commit and:
- load staging sstables to the view building worker on the target
shard, at the end of `streaming` stage
- clear migrated staging sstables on source shard in `cleanup` stage
This patch also removes skip mark in `test_staging_sstables_with_tablet_merge`.
Fixesscylladb/scylladb#26244
There was a problem with staging sstables after tablet merge.
Let's say there were 2 tablets and tablet 1 (lower last token)
had an staging sstable. Then a tablet merge occured, so there is only
one tablet now (higher last token).
But entries in `_staging_sstables`, which are grouped by last token, are
never adjusted.
Since there shouldn't be thousands of sstables, we can just hold list of
sstables per table and filter necessary entries when doing
`process_staging` view building task.
docs/alternator/compatibility.md describes support for global (multi-DC)
tables, and suggests that the CQL command "ALTER TABLE" should be used
to change the replication of an Alternator table. But actually, the
right command is "ALTER KEYSPACE", not "ALTER TABLE". So fix the
document.
Fixes#26737Closesscylladb/scylladb#26872
Add `AuditBackendComposite`, a test class which allows testing multiple
audit outputs in a single run, implemented in `audit_composite_storage_helper`
class.
Add two more tests.
`test_composite_audit_type_invalid` tests if an invalid audit mode among
correct ones causes the same error as when it is the only specified audit mode.
`test_composite_audit_empty_settings` tests if `'none'` audit mode, when
specified along other audit modes, properly disables audit logging.
Refs #26022
Before this patch audit test could process audit logs from a single
audit output. This patch adds support for multiple audit outputs
in the same run. The change is needed in order to test
`audit_composite_storage_helper`, which can write to multiple
audit outputs.
Refs #26022
This patch adds support for multiple audit log outputs.
If only one audit log output is enabled, the behavior does not change.
If multiple audit log outputs are enabled, then the
`audit_composite_storage_helper` class is used. It has a collection
of `storage_helper` objects.
Fixes#26022
Add test case for non-numeric PERCENTILE value, which raises an error
different to the out-of-range invalid values. Regex in the test
test_invalid_percentile_speculative_retry_values is expanded.
Refs #26369
Change speculative_retry::to_sstring and speculative_retry::from_sstring
to throw exceptions::configuration_exception instead of std::invalid_argument.
These errors can be triggered by CQL, so appropriate CQL exception should be
used.
Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304
Refs #26369
Clarify how the value of `XPERCENTILE` is handled:
- Values 0 and 100 are supported
- The percentile value is rounded to the nearest 0.1 (1 decimal place)
Refs #26369
test_valid_percentile_speculative_retry_values is introduced to test that
valid values for speculative_retry are properly accepted.
Some of the values are moved from the
test_invalid_percentile_speculative_retry_values test, because
the previous commit added support for them.
Refs #26369
This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry.
It was possible to specify these values before #21825. #21825 prevented specifying
invalid values, like -1 and 101, but also prevented using 0 and 100.
On top of that, speculative_retry::to_sstring function did rounding when
formatting the string, which introduced inconsistency.
Fixes#26369
With tablets enabled, we can't create an Alternator table on a three-
node cluster with a single rack, since Scylla refuses RF=3 with just
one rack and we get the error:
An error occurred (InternalServerError) when calling the CreateTable
operation: ... Replication factor 3 exceeds the number of racks (1) in
dc datacenter1
So in test/cluster/test_alternator.py we need to use the incantation
"auto_rack_dc='dc1'" every time that we create a three-node cluster.
Before this patch, several tests in test/cluster/test_alternator.py
failed on this error, with this patch all of them pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When in tablets_mode_for_new_keyspaces=enforced mode, Alternator is
supposed to fail when CreateTable asks explicitly for vnodes. Before
this patch, this error was an ugly "Internal Server Error" (an
exception thrown from deep inside the implementation), this patch
checks for this case in the right place, to generate a proper
ValidationException with a proper error message.
We also enable the test test_tablets_tag_vs_config which should have
caught this error, but didn't because it was marked xfail because
tablets_mode_for_new_keyspaces had not been live-updatable. Now that
it is, we can enable the test. I also improved the test to be slightly
faster (no need to change the configuration so many times) and also
check the ordinary case - where the schema doesn't choose neither
vnodes nor tablets explicitly and we should just use the default.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We have a configuration option "tablets_mode_for_new_keyspaces" which
determines whether new keyspaces should use tablets or vnodes.
For some reason, this configuration parameter was never marked live-
updatable, so in this patch we add flag. No other changes are needed -
the existing code that uses this flag always uses it through the
up-to-date configuration.
In the previous patches we start to honor tablets_mode_for_new_keyspaces
also in Alternator CreateTable, and we wanted to test this but couldn't
do this in test/alternator because the option was not live-updatable.
Now that it will be, we'll be able to test this feature in
test/alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patches added a somewhat misleading comment in front of
system:initial_tablets, which this patch improves.
That tag is NOT where Alternator "stores" table properties like the
existing comment claimed. In fact, the whole point is that it's the
opposite - Alternator never writes to this tag - it's a user-writable
tag which Alternator *reads*, to configure the new table. And this is
why it obviously can't be hidden from the user.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
With tablets, the test begun failing. The failure was correlated with
the number of initial tablets, which when kept at default, equals
4 tablets per shard in release build and 2 tablets per shard in dev
build.
In this patch we split the test into two - one with a more data in
the table to check the original purpose of this test - that Scan
doesn't return the entire table in one page if "Limit" is missing.
The other test reproduces issue #10327 - that when the table is
small, Scan's page size isn't strictly limited to 1MB as it is in
DynamoDB.
Experimentally, 8000 KB of data (compared to 6000 KB before this patch)
is enough when we have up to 4 initial tablets per shard (so 8 initial
tablets on a two-shard node as we typically run in tests).
Original patch by Piotr Szymaniak <piotr.szymaniak@scylladb.com>
modified by Nadav Har'El <nyh@scylladb.com>
Since #23662 Alternator supports TTL with tablets too. Let's clear some
leftovers causing Alternator to test TTL with vnodes instead of with
what is default for Alternator (tablets or vnodes).
Since Alternator is capable of runnng with tablets according to the
flag in config, remove the obsolete test that is making sure
that Alternator runs with vnodes.
The tag was lately renamed from `experimental:initial_tablets` to
`system::initial_tablets`. This commit fixes both the tests as well as
the exceptions sent to the user instructing how to create table with
vnodes.
Reflect honouring by Alternator the value of the config flag
`tablets_mode_for_new_keyspaces`, as well as renaming of the tag
`experimental:initial_tablets` into `system:initial_tablets`.
Until now, tablets in Alternator were experimental feature enabled only
when a TAG "experimental:initial_tablets" was present when creating a
table and associated with a numeric value.
After this patch, Alternator honours the value of
`tablets_mode_for_new_keyspaces` config flag.
Each table can be overriden to use tablets or not by supplying a new TAG
"system:initial_tablets". The rules stay the same as with the earlier,
experimental tag: when supplied with a numeric value, the table will use
tablets (as long as they are supported). When supplied with something
else (like a string "none"), the table will use vnodes, provided that
tablets are not `enforced` by the config flag.
Fixes#22463
Those test cases use lister::scan_dir() to validate the contents of snapshot directory of a table against this table's base directory. This PR generalizes the listing code making it shorter.
Also, the snapshot_skip_flush_works case is missing the check for "schema.cql" file. Nothing is wrong with it, but the test is more accurate if checking it.
Also, the snapshot_with_quarantine_works case tries to check if one set of names is sub-set of another using lengthy code. Using std::includes improves the test readability a lot.
Also, the PR replaces lister::scan_dir() with directory_lister. The former is going to be removed some day (see also #26586)
Improving existing working test, no backport is needed.
Closesscylladb/scylladb#26693
* github.com:scylladb/scylladb:
database_test: Simplify snapshot_with_quarantine_works() test
database_test: Improve snapshot_skip_flush_works test
database_test: Simplify snapshot_works() tests
database_test: Use collect_files() to remove files
database_test: Use collectz_files() to count files in directory
database_test: Introduce collect_files() helper
_last_key is a multi-fragment buffer.
Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.
The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.
But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).
Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.
Fix that.
We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.
Fixesscylladb/scylladb#26819Closesscylladb/scylladb#26839
In the present scenario, there are issues in left_token_ring transition state
execution in the decommissioning path. In case of concurrent mutation race
conditions, we enter left_token_ring more than once, and apparently if
we enter left token ring second time, we try to barrier the decommisioned
node, which at this point is no longer possible. That's what causes the errors.
This pr resolves the issue by adding a check right in the start of
left_token_ring to check if the first topology state update, which marks
the request as done is completed. In this case, its confirmed that this
is the second time flow is entering left_token_ring and the steps preceding
the request status update should be skipped. In such cases, all the rest
steps are skipped and topology node status update( which threw error in
previous trial) is executed directly. Node removal status from group0 is
also checked and remove operation is retried if failed last time.
Although these changes are done with regard to the decommission operation
behavior in `left_token_ring` transition state, but since the pr doesn't
interfere with the core logic, it should not derail any rollback specific
logic. The changes just prevent some non-idempotent operations from
re-occuring in case of failures. Rest of the core logic remain intact.
Test is also added to confirm the proper working of the same.
Fixes: scylladb/scylladb#20865
Backport is not needed, since this is not a super critical bug fix.
Closesscylladb/scylladb#26717
Fixes: #26440
1. Added description to primary-replica-only option
2. Fixed code text to better reflect the constrained cheked in the code
itself. namely: that both primary replica only and scope must be
applied only if load and steam is applied too, and that they are mutual
exclusive to each other.
Note: when https://github.com/scylladb/scylladb/issues/26584 is
implemented (with #26609) there will be a need to align the docs as
well - namely, primary-replica-only and scope will no longer be
mutual exclusive
Signed-off-by: Ran Regev <ran.regev@scylladb.com>
Closesscylladb/scylladb#26480
The variable was unused since cae999c094 ("toolchain: change
optimized clang install method to standard one"), and now causes
the differential shellcheck continuous integration test to fail whenever
it is changed. Remove it.
Closesscylladb/scylladb#26796
In [#26408](https://github.com/scylladb/scylladb/pull/26408) a `write_handler_destroy_promise` class was introduced to wait for `abstract_write_response_handler` instances destruction. We strived to minimize the memory footprint of `abstract_write_response_handler`, with `write_handler_destroy_promise`-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector can become big and cause 'oversized allocation' seastar warnings.
Another concern with `write_handler_destroy_promise`-es [was that they were more complicated than it was worth](https://github.com/scylladb/scylladb/pull/26408#pullrequestreview-3361001103).
In this commit we replace `write_handler_destroy_promise` with simple gates. One or more gates can be attached to an `abstract_write_response_handler` to wait for its destruction. We use `utils::small_vector` to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is `storage_proxy::_write_handlers_gate`, which is used to wait for all handlers in `cancel_all_write_response_handlers`. Another one can be attached by a caller of `cancel_write_handlers`. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare.
The `sizeof(utils::small_vector) == 40`, this is `40.0 / 488 * 100 ~ 8%` increase in `sizeof(abstract_write_response_handler)`, which seems acceptable.
Fixes [scylladb/scylladb#26788](https://github.com/scylladb/scylladb/issues/26788)
backport: need to backport to 2025.4 (LWT for tablets release)
Closesscylladb/scylladb#26827
* https://github.com/scylladb/scylladb:
storage_proxy: use coroutine::maybe_yield();
storage_proxy: use gates to track write handlers destruction
* tools/cqlsh f852b1f5...19445a5c (2):
> Update scylla-driver version to 3.29.4
Update tools/cqlsh submodule for scylla-driver 3.29.4
The motivation for this update is to resolve a driver-side serialization bug that was blocking work on #26740. The bug affected vector<collection> types (e.g., vector<set<int>,1>) and is fixed in scylla-driver versions 3.29.2+.
Refs #26704
It is useful to check time spent on tablet repair. It can be used to
compare incremental repair and non-incremental repair. The time does not
include the time waiting for the tablet scheduler to schedule the tablet
repair task.
Fixes#26505Closesscylladb/scylladb#26502
Extract storage helper creation into `create_storage_helper` function.
Call this function from `audit::audit`. It will be called per shard inside
`sharded<audit>::start` method.
Refs #26022
There is no need to have `create_audit` separate from `start_audit`.
`create_audit` just stores the passed parameters, while `start_audit`
does the actual initialization and startup work.
Refs #26022
Re-enable parallel execution to get better logs.
Note, this is somewhat wasteful, as we won't re-use test fixture here,
but in the end, it is probably an improvement.
When we build a materialized view we read the entire base table from start to
end to generate all required view udpates. If a view is created while another view
is being built on the same base table, this is optimized - we start generating
view udpates for the new view from the base table rows that we're currently
reading, and we read the missed initial range again after the previous view
finishes building.
The view building progress is only updated after generating view updates for
some read partitions. However, there are scenarios where we'll generate no
view updates for the entire read range. If this was not handled we could
end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293
To handle this, we mark the view as built if the reader generated no partitions.
However, this is not always the correct conclusion. Another scenario where
the reader won't encounter any partitions is when view building is interrupted,
and then we perform a reshard. In this scenario, we set the reader for all
shards to the last unbuilt token for an existing partition before the reshard.
However, this partition may not exist on a shard after reshard, and if there
are also no partitions with higher tokens, the reader will generate no partitions
even though it hasn't finished view building.
Additionally, we already have a check that prevents infinite view building loops
without taking the partitions generated by the reader into account. At the end
of stream, before looping back to the start, we advance current_key to the end
of the built range and check for built views in that range. This handles the case
where the entire range is empty - the conditions for a built view are:
1. the "next_token" is no greater than "first_token" (the view building process
looped back, so we've built all tokens above "first_token")
2. the "current_token" is no less than "first_token" (after looping back, we've
built all tokens below "first_token")
If the range is empty, we'll pass these conditions on an empty range after advancing
"current_key" to the end because:
1. after looping back, "next_token" will be set to `dht::minimum_token`
2. "current_key" will be set to `dht::ring_position::max()`
In this patch we remove the check for partitions generated by the reader. This fixes
the issue with resharding and it does not resurrect the issue with infinite view building
that the check was introduced for.
Fixes https://github.com/scylladb/scylladb/issues/26523Closesscylladb/scylladb#26635
In #26408 a write_handler_destroy_promise class was introduced to
wait for abstract_write_response_handler instances destruction. We
strived to minimize the memory footprint of
abstract_write_response_handler, with write_handler_destroy_promise-es
we required only a single additional int. It turned our that in some
cases a lot of write handlers can be scheduled for deletion
at the same time, in such cases the
vector<write_handler_destroy_promise> can become big and cause
'oversized allocation' seastar warnings.
Another concern with write_handler_destroy_promise-es was that they
were more complicated than it was worth.
In this commit we replace write_handler_destroy_promise with simple
gates. One or more gates can be attached to an
abstract_write_response_handler to wait for its destruction. We use
utils::small_vector<gate::holder, 2> to store the attached gates.
The limit 2 was chosen because we expect two gates at the same time
in most cases. One is storage_proxy::_write_handlers_gate,
which is used to wait for all handlers in
cancel_all_write_response_handlers. Another one can be attached by
a caller of cancel_write_handlers. Nothing stops several
cancel_write_handlers to be called at the same time, but it should be
rare.
The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is
40.0 / 488 * 100 ~ 8% increase in
sizeof(abstract_write_response_handler), which seems acceptable.
Fixesscylladb/scylladb#26788
In pull request #26384 a discussion started whether page_size=0 really
disables paging, or maybe one needs page_size=-1 to truly disable paging.
The reason for that discussion was commit 08c81427b that started to
use page_size=-1 for internal unpaged queries, and commit 76b31a3 that
incorrectly claimed that page_size>=0 means paging is enabled.
This patch introduces a test that confirms that with page_size=0, paging
is truly disabled - including the size-based (1MB) paging.
The new test is Scylla-only, because Cassandra is anyway missing the
size-based page cutoff (see CASSANDRA-11745).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26742
Introduced in 9ebdeb2
The problem is specific to node replacing and rack-list RF. The
culprit is in the part of the load balancer which determines rack's
shard count. If we're replacing the last node, the rack will contain
no normal nodes, and shards_per_rack will have no entry for the rack,
on which the table still has replicas. This throws std::out_of_range
and fails the tablet draining stage, and node replace is failed.
No backport because the problem exists only on master.
Fixes#26768Closesscylladb/scylladb#26783
Python 3.14 changed the multiprocessing fork mode to "forkserver",
presumably for good reasons. However, it conflicts with our
relocatable Python system. "forkserver" forks and execs a Python
process at startup, but it does this without supplying our relocated
ld.so. The system ld.so detects a conflict and crashes.
Fix this by switching back to "fork", which is sufficient for
housekeeping's modest needs.
Closesscylladb/scylladb#26831
The test collects Data files from table dir, then _all_ files from
snapshot dir and then checks whether the former is the subset of the
latter. Using std::includes over two sets makes the code much shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It has two inaccuracies.
First, when checking the contents of table directory, it uses
pre-populated expected list with "manifest.json" in it. Weird.
Second, when cechking the contents of snapshot directory it doesn't
check if the "schema.cql" is there. It's always there, but if something
breaks in the future it may come unnoticed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No functional changes here, just make use of the new lister to shorten
the code. A small side effect -- if the test fails because contents of
directories changes, it will print the exact difference in logs, not
just that N files are missing/present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some test cases remove files from table directory to perform some checks
over the taken snapshots. Using collect_files() helper makes the code
easier to read.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some test cases want to see that there are more than one file in a
directory, so they can just re-use the new helper. Much shorter this
way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It returns a set of files in a given directoy. Will be used by all next
patches.
Implemented using directory_lister, not lister::scan_dir in order to
help removing the latter one in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#26781
Makes the test independent of wrapping scripts. Note: retains the
split into "real" and "mock" tests. For other tests, we either all
mock, or allow the environment to select mock or real. Here we have
them combined. More expensive, but otoh more thourough.
Wraps the real/mock azure server for test in a fixture.
Note: retains the current test setup which explicitly runs
some tests with "real" azure, if avail, and some always mock.
Runs local-kms mock AWS KMS server unless overridden by env var.
Allows tests to use real or fake AWS KMS endpoint and shared fixture
for quicker execution.
When a tablet migration is started, we abort the corresponding view
building tasks (i.e. we change the state of those tasks to "ABORTED").
However, we don't change the host and shard of these tasks until the
migration successfully completes. When for some reason we have to
rollback the migration, that means the migration didn't finish and
the aborted task still has the host and shard of the migration
source. So when we recreate tasks that should no longer be aborted
due to a rolled-back migration, we should look at the aborted tasks
of the source (leaving) replica. But we don't do it and we look at
the aborted tasks of the target replica.
In this patch we adjust the rollback mechanism to recreate tasks
for the migration source instead of destination. We also fix the
test that should have detected this issue - the injection that
the test was using didn't make us rollback, but we simply retried
a stage of the tablet migration. By using one_shot=False and adding
a second injection, we can now guarantee that the migration will
eventually fail and we'll continue to the 'cleanup_target' and
'revert_migration' stages.
Fixes https://github.com/scylladb/scylladb/issues/26691Closesscylladb/scylladb#26825
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Unify the handlers of synchronous and asynchronous cleanup, major
compaction, and upgrade_sstables.
Fixes: https://github.com/scylladb/scylladb/issues/26715.
Requires backports to all live versions
Closesscylladb/scylladb#26746
* github.com:scylladb/scylladb:
api: storage_service: tasks: unify upgrade_sstable
api: storage_service: tasks: force_keyspace_cleanup
api: storage_service: tasks: unify force_keyspace_compaction
An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys.
This series introduces a solution for users who want to **safely** switch `alternator_enforce_authorization` from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors.
The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages, the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true.
Fixes#25308
This is a feature that users need, so it should probably be backported to live branches.
Closesscylladb/scylladb#25457
* github.com:scylladb/scylladb:
docs/alternator: explain alternator_warn_authorization
test/alternator: tests for new auth failure metrics and log messages
alternator: add alternator_warn_authorization config
There's a test that checks if temporary-statistics file is gone at some
point. It does it by listing the directory it expects the file to be in
and then comparing the names met with the temp. stat. file name.
It looks like a single file_exists() call is enough for that purpose.
As a "sanity" check this patch adds a validation that non-temporary
statistics file is there, all the more so this file is removed after the
test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26743
Support the counters feature in tablets keyspaces.
The main change is to fix the counter update during tablets intranode migration.
Counter cell is c = map<host_id, value>. A counter update is applied by doing read-modify-write on a leader replica to retrieve the current host's counter value and transform the mutation to contain the updated value for the host, then apply the mutation and replicate it to other hosts. the read-modify-write is protected against concurrent updates by locking the counter cell.
When the counter is migrated between two shards, it's not enough to lock the counter on the read shard, because in the stage write_both_read_new the read shard is switched, and then we can have concurrent updates reach either the old or the new shard. In order to keep the counter update exclusive we lock both shards when in the stage write_both_read_new.
Also, when applying the transformed mutation we need to respect write_both stages and apply the mutation on both shards. We change it to use `apply_on_shards` similarly to other methods in storage proxy.
The change applies to both tablets and vnodes, they use the same implementation, but for vnodes the behavior should remain equivalent up to some small reordering of the code since it doesn't have intranode migration and reduces to single read shard = write shard.
Fixes https://github.com/scylladb/scylladb/issues/18180
no backport - new feature
Closesscylladb/scylladb#26636
* github.com:scylladb/scylladb:
docs: counters now work with tablets
pgo: enable counters with tablets
test: enable counters tests with tablets
test: add counters with tablets test
cql3: remove warning when creating keyspace with tablets
cql3: allow counters with tablets
storage_proxy: lock all read shards for counter update
storage_proxy: apply counter mutation on all write shards
storage_proxy: move counter update coordination to storage proxy
storage_proxy: refactor mutate_counter_on_leader
replica/db: add counter update guard
replica/db: split counter update helper functions
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.
The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.
Fixes#22707
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#26730
Counters are now supported in tablet-enabled keyspaces, so remove
the documentation that listed counters as an unsupported feature
and the note warning users about the limitation.
Enable all counters-related tests that were disabled for tablets because
counters was not supported with tablets until now.
Some tests were parametrized to run with both vnodes and tablets, and
the tablets case was skipped, in order to not lose coverage. We change
them to run with the default configuration since now counters is
supported with both vnodes and tablets, and the implementation is the
same, so there is no benefit in running them with both configurations.
add a new test for counters with tablets to test things that are
specific to tablets. test counter updates that are concurrent with
tablet internode and intranode migrations and verify it remains
consistent and no updates are lost.
When creating a keyspace with tablets, a warning is shown with all the
unsupported features for tablets, which is only counters currently.
Now that counters is also supported with tablets, we can remove this
warning entirely.
Now that counters work with tablets, allow to create a table with
counters in a tablets-enabled keyspace, and remove the warning about
counters not being supported when creating a keyspace with tablets.
We allow to use counters with tablets only when all nodes are upgraded
and support counters with tablets. We add a new feature flag to
determine if this is the case.
Fixesscylladb/scylladb#18180
Previously in a counter update we lock the read shard to protect the
counter's read-modify-write against concurrent updates.
This is not sufficient when the counter is migrated between different
shards, because there is a stage where the read shard switches from the
old shard to the new shard, and during that switch there can be
concurrent counter updates on both shards. If each shard takes only its
own lock, the operations will not be exclusive anymore, and this can
cause lost counter updates.
To fix this, we acquire the counter lock on both shards in the stage
write_both_read_new, when both shards can serve reads. This guarantees
that counter updates continue to be exclusive during intranode
migration.
When applying a counter mutation, use apply_on_shards to apply the
mutation on all write shards, similarly to the way other mutations are
applied in the storage proxy. Previously the mutation was applied only
on the current shard which is the read shard.
This is needed to respect the write_both stages of intranode migration
where we need to apply the mutation on both the old and the new shards.
Refactor the counter update to split the functions and have them called
by the storage proxy to prepare for a later change.
Previously in mutate_counter the storage proxy calls the replica
function apply_counter_update that does a few things:
1. checks that the operation can be done: check timeout, disk utilization
2. acquire counter locks
3. do read-modify-write and transform the counter mutation
4. apply the mutation in the replica
In this commit we change it so that these functions are split and called
from the storage proxy, so that we have better control from the storage
proxy when we change it later to work across multiple shards. For
example, we will want to acquire locks on multiple shards, transform it
on one shard, and then apply the mutation on multiple shards.
After the change it works as follows in storage proxy:
1. acquire counter locks
2. call replica prepare to check the operation and transform the mutation
3. call replica apply to apply the transformed mutation
Slightly reorganize the mutate counter function to prepare it for a
later change.
Move the code that finds the read shard and invokes the rest of the
function on the read shard to the caller function. This simplifies the
function mutate_counter_on_leader_and_replicate which now runs on the
read shard and will make it easier to extend.
Add a RAII guard for counter update that holds the counter locks and the
table operation, and extract the creation of the guard to a separate
function.
This prepares it for a later change where we will want to obtain the
guard externally from the storage proxy.
parse() taking a list of elements is quadratic (during compile time) in
that it generates recursive calls to itself, each time with one fewer
parameter. The total size of the parameter lists in all these generated
functions is quadratic in the initial parameter list size.
It's also exponential if we ignore inlining limits, since each .then()
call expands to two branches - a ready future branch and a non-ready
future branch. If the compiler did not give up, we'd have 2^list_len
branches. For sure the compiler does not do so indefinitely, but the effort
getting there is wasted.
Simplify by using a fold expression over the comma operator. Instead
of passing the remaining parameter list in each step, we pass only
the parameter we are processing now, making processing linear, and not
generating unnecessary functions.
It would be better expressed using pack expansion statements, but these
are part of C++26.
The largest offender is probably stats_metadata, with 21 elements.
dev-mode sstables.o:
text data bss dec hex filename
1760059 1312 7673 1769044 1afe54 sstables.o.before
1745533 1312 7673 1754518 1ac596 sstables.o.after
We save about 15k of text with presumably a corresponding (small)
decrease in compile time.
Closesscylladb/scylladb#26735
This check is incorrect: the current shard may be looking at
the old version of tablets map:
* an accept RPC comes to replica shard 0, which is already at write_both_read_new
* the new shard is shard 1, so paxos_state::accept is called on shard 1
* shard 1 is still at "streaming" -> shards_ready_for_reads() returns old
shard 0
Fixesscylladb/scylladb#26801Closesscylladb/scylladb#26809
If a node is dead and cannot be brought back, tablet migrations are
stuck, until the node is explicitly marked as "permanently dead" /
"ignored node" / "excluded" (name differs in different contexts).
Currently, this is done during removenode and replace operations but
it should be possible to only mark the node as dead, for the purpose
of unblocking migrations or other topology operations, without doing
the actual removenode, because full removal might be currently
impossible, or not desirable due to lack of capacity or priorities.
This patch introduces this kind of API:
```
nodetool excludenode <host-id> [ ... <host-id> ]
```
Having this kind of API is an improvement in user experience in
several cases. For example, when we lose a rack, the only viable
option for recovery is to run removenode with an extra
--ignore-dead-nodes option. This removenode will fail in the tablet
draining phase, as there is no live node in the rack to rebuild
replicas in. This is confusing to the operator. But necessary before
ALTER KEYSPACE can proceed in order to change replication options to
drop the rack from RF.
Having this API allows operators to have more unified procedures,
where "nodetool excludenode" is always the first step of recovery,
which unblocks further topology operations, both those which restore
capacity, but also auto-scaling, tablet split/merge, load balancing,
etc.
Fixes#21281
The PR also changes "nodetool status" to show excluded nodes,
they have 'X' in their status instead of 'D'.
Closesscylladb/scylladb#26659
* github.com:scylladb/scylladb:
nodetool: status: Show excluded nodes as having status 'X'
test: py: Test scenario involving excludenode API
nodetool: Introduce excludenode command
Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one.
Adopting newer seastar API, no need to backport
Closesscylladb/scylladb#26747
* github.com:scylladb/scylladb:
commitlog: Remove unused work::r stream variable
ec2_snitch: Fix indentation after previous patch
ec2_snitch: Coroutinize the aws_api_call_once()
sstable: Construct output_stream for data instantly
test: Don't reuse on-stack input stream
`sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is LZ4Compressor (inherited from Cassandra).
Make LZ4WithDictsCompressor the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios.
If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback.
Fixes#26610.
Closesscylladb/scylladb#26697
* github.com:scylladb/scylladb:
test/cluster: Add test for default SSTable compressor
db/config: Change default SSTable compressor to LZ4WithDictsCompressor
db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl
boost/cql_query_test: Get expected compressor from config
The compaction manager backlog is exposed via metrics, but if static
shares are set, the backlog is never calculated. As a result, there is
no way to determine the backlog and if the static shares need
adjustment. Fix that by calculating backlog even when static shares are
set.
Fixes#26287
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26778
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.
a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.
Fixesscylladb/scylladb#26791Closesscylladb/scylladb#26792
This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations.
This is the second part of the size based load balancing changes:
- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254
This is a new feature and backport is not needed.
Closesscylladb/scylladb#26152
* github.com:scylladb/scylladb:
load_balancer: load_stats reconcile after tablet migration and table resize
load_stats: change data structure which contains tablet sizes
If a node is dead and cannot be brought back, tablet migrations are
stuck, until the node is explicitly marked as "permanently dead" /
"ignored node" / "excluded" (name differs in different contexts).
Currently, this is done during removenode and replace operations but
it should be possible to only mark the node as dead, for the purpose
of unblocking migrations or other topology operations, without doing
the actual removenode, because full removal might be currently
impossible, or not desirable due to lack of capacity or priorities.
This patch introduces this kind of API:
nodetool excludenode <host-id> [ ... <host-id> ]
Having this kind of API is an improvement in user experience in
several cases. For example, when we lose a rack, the only viable
option for recovery is to run removenode with an extra
--ignore-dead-nodes option. This removenode will fail in the tablet
draining phase, as there is no live node in the rack to rebuild
replicas in. This is confusing to the operator. But necessary before
ALTER KEYSPACE can proceed in order to change replication options to
drop the rack from RF.
Having this API allows operators to have more unified procedures,
where "nodetool excludenode" is always the first step of recovery,
which unblocks further topology operations, both those which restore
capacity, but also auto-scaling, tablet split/merge, load balancing,
etc.
Fixes#21281
We want to move towards rack-list based replication factor for tablets being the default mode, and in the future the only supported mode. This PR is a step towards that. We auto-expand numeric RF to rack list on keyspace creation and ALTER when rf_rack_valid_keyspaces option is enabled.
The PR is mostly about adjusting tests. The main logic change is in the last patch, which modifies option post-processing in ks_prop_defs.
Fixes#26397Closesscylladb/scylladb#26692
* github.com:scylladb/scylladb:
cql3: ks_prop_defs: Expand numeric RF to rack list
locator: Move rack_list to topology.hh
alternator: Do not set RF for zero-token DCs
alternator: Switch keyspace creation to use ks_prop_defs
test: alternator: Adjust for rack lists
cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs
test: cqlpy: Mark tests using rack lists as scylla-only
test: Switch to rack-list based RF
test: Generalize tests to work with both numeric RF and rack lists
test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF
test: Prepare for handling errors specific to rack list path
test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention
test: Create cluster with multiple racks in multi-dc setups
test: boost: network_topology_strategy_test: Adjust to rack-list RF
test: tablets: Adjust to rack list
test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness
test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid
test: object_store: test_backup: Adjust for rack lists
test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity
test: cluster: mv: Do not move tablets across racks
test: cluster: util: Fix docstring for parse_replication_options()
tablets, topology_coordinator: Skip tablet draining on replace
When auto compaction is disabled, all ongoing compactions, including
major compactions, are stopped. However, major compactions should not be
stopped, since the disable request applies only to regular auto
compactions.
This PR fixes the issue by tagging major compaction tasks with a newly
introduced `compaction_type::Major` enum. Since
`table::disable_auto_compaction()` already requests the compaction
manager to stop only tasks of type `compaction_type::Compaction`, major
compactions will no longer be stopped.
Fixes#24501
PR improves how the compactions are stopped when a disable auto compaction request is executed.
No need to backport
Closesscylladb/scylladb#26288
* github.com:scylladb/scylladb:
replica/table: do not stop major compaction when disabling auto compaction
compaction/compaction_descriptor: introduce compaction_type::Major
The previous patch made the default compressor dependent on the
SSTABLE_COMPRESSION_DICTS feature:
* LZ4Compressor if the feature is disabled
* LZ4WithDictsCompressor if the feature is enabled
Add a test to verify that the cluster uses the right default in every
case.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`sstable_compression_user_table_options` allows configuring a
node-global SSTable compression algorithm for user tables via
scylla.yaml. The current default is `LZ4Compressor` (inherited from
Cassandra).
Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets
in the field have shown significant improvements in compression ratios.
If the dictionary compression feature is not enabled in the cluster
(e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the
feature is enabled, flip the default back to the dictionary compressor
using with a listener callback.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
We adjust the test to RF-rack-validity and then re-enable
index random events, which requires the configuration option
`rf_rack_valid_keyspaces` to be enabled.
Fixesscylladb/scylladb#26422
Backport: I'd rather not backport these changes. They're almost a hack and poses too much risk for little gain.
Closesscylladb/scylladb#26591
* github.com:scylladb/scylladb:
test/cluster/random_failures: Re-enable index events
test/cluster/random_failures: Enable rf_rack_valid_keyspaces
test/cluster/random_failures: Adjust to RF-rack-validity
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Unify the handlers of /storage_service/keyspace_upgrade_sstables/{keyspace}
and /tasks/compaction/keyspace_upgrade_sstables/{keyspace}.
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Unify the handlers of /storage_service/keyspace_cleanup/{keyspace}
and /tasks/compaction/keyspace_cleanup/{keyspace}.
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Add consider_only_existing_data parameter to /tasks/compaction/keyspace_compaction/{keyspace},
to match the synchronous version of the api (/storage_service/keyspace_compaction/{keyspace}).
Unify the handlers of both apis.
The tests pass only with alternator_streams_strict_compatibility flag
enabled, because of a suspected non-negligible performance impact (i.e.
an additional entire-item comparison and type conversions).
Refs https://github.com/scylladb/scylladb/issues/6918
Deletes that don't change the state of the database visible to the user
(e.g. an attempt to delete a missing item) shouldn't produce a cdc log.
This commit addresses this DynamoDB compatibility issue if the delete is
a partition delete, a row delete, or a cell delete. It works under the
assumption that the change was produced by Alternator. This means that
it doesn't support range deletes, static row deletes, deletes of
collection cells other than a map, etc. See also its parent commit,
which introduces the methods that this commit extends.
This commit handles the following cases:
- `DeleteItem of nonexistent item: nothing`,
- `BatchWriteItem.DeleteItem of nonexistent item: nothing`.
Refs https://github.com/scylladb/scylladb/pull/26121
This commit adds a function that compares split mutations with the
`row_state`, that was selected as a preimage or propagated through
cdc options by a caller. If the items are equal, the corresponding log
row isn't generated. The result being that creating an item with
BatchWriteItem, PutItem, or UpdateItem doesn't emit an INSERT/MODIFY
event if exactly identical item already exists.
Comparing the items may be costly, so this logic is controlled by
`alternator_streams_compabitiblity` flag.
This commit handles the following cases:
- `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and equal
item: nothing`
This commit improves compatibility with DynamoDB streams by changing the
emitted events when creating/updating an item. Replace/update operations
of an existing item emit a MODIFY, whereas replacing/updating a missing
item results in an INSERT. If the state of the item doesn't change after
applying the operation, no event is emitted.
This commit handles the following cases:
- `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and not equal item: MODIFY`
- `PutItem/UpdateItem/BatchWriteItem.PutItem of a nonexistent item: INSERT`
Refs https://github.com/scylladb/scylladb/issues/6918
Change the type from future<executor::request_return_type> to
executor::request_return_type, because the method isn't async and one
out of two callers unwraps the future immediately. This simplifies the
code a little and probably saves a few instructions, since we suspect
that moving a future<X> is more expensive than just moving X.
With this flag enabled, Alternator Streams produces more accurate event
types:
- nop operations (i.e. replacing an item with an identical one, deleting
a nonexistent item) don't produce an event,
- updates of an existing item produce a MODIFY event, instead of INSERT,
- etc.
This flag affects the internal behaviour of some operations, i.e.
Alternator may select a preimage and propagate it to CDC (in contrary to
CDC making the request), or do extra item comparisons (i.e. compare the
existing item with the new one). These operations may be costly, and
users that don't use Streams won't need them.
This flag is live-updatable. An operation reads this flag once, and uses
its value for the entire operation.
CDC log table records a mutation as a sequence of log rows that record
an atomic change (i.e. a row marker, tombstones, etc.), whereas a
mutation in Alternator Streams always appears as a single log row. The
type of operation is determined based on the type of the last log row in
CDC.
As a result, updates that create a row always appeared to Alternator
Streams as an update (row marker + data), rather than an insert. This
commit makes them a single log row. Its operation type is insert if it
contains a row marker, and an update otherwise, which gives results
consistent with DynamoDB Streams.
Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for
new DCs specified in the statement.
Doesn't auto-expand existing options, as the rack choice may not be in
line with current replica placement. This requires co-locating tablet
replicas, and tracking of co-location state, which is not implemented yet.
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
So that we get the same validation and option post-processing as
during regular keyspace creation.
RF auto-expansion logic happens in ks_prop_defs, and we want that
for tablets.
To achieve RF=3 with tablets and rf_rack_valid_keyspaces, we need 3
racks. So change the test to create 3 racks. Alternator was bypassing
standard keyspace creation path, so it escaped validation. But this
will change, and the test will stop wroking.
Also, after auto-expansion of RF to rack list, not all of 4 nodes
will host replicas. So need to adjust expectations.
Tests expect this failure in some scenarios, but later changes make us
fail ealier due to topology constraints.
As a rule, more general validation should come before more specific
validation. So syntax validation before topology validation.
Have to do that before we enable auto-expansion of numeric RF to
rack-lists, because those tests alter the replication factor, and
altering from rack-list to numeric will not be allowed.
Two changes here:
1) Allocate nodes in dc2 in separeate racks to make the test stronger
- it invites bugs where RF==nr_racks succeeds despite there being
zero-token nodes, and not simply fail due to rack count.
2) Due to auto-expansion to rack list, scylla throws in keyspace
creation rather than table creation.
With rf_rack_valid_keyspaces enabled, RF of alternator tables will be
equal to the number of racks (in this test: nodes). Prior to that, if
number of nodes is smaller than 3, alternator creates the keyspace
with RF=1. Turns out, with RF=2 the test fails with write timeouts due
to contention. Enforce RF=1 by creating the table with one node before
adding the second node.
test_decommission_rack_load_failure expects some tablets to land in
the rack which only has the decommissioning node. Since the table uses
RF=1, auto-expansion may choose the other rack and put all tablets
there, and the expected failure will not happen. Force placement by
using rack-list RF.
Choose old_replica and new_replica so that they're both in rack r1.
After later changes (rack list auto expansion), it's no longer
guaranteed that the first replica will be on r1.
Replace doesn't drain (rebuild) tablets during topology change. They
are rebuilt afterwards when the replaced node is in "left" state and
replacing node is in normal state. So there is no point in attempting
to drain, as nothing will be drained.
Not only that, doing so has a risk, because the load balancer is
invoked on a transitional topology state in which we can end up with
no normal nodes in a rack. That's the case if the replaced node was
the last one in the rack. This tripped one of the algorithms which
computes rack's shard count for the purpose of determining ideal
tablet count, it was not prepared to find an empty rack to which a
table is still repliacated. That was fixed separately, but to avoid
this, we better skip tablet draining here.
This patch introduces a new overload of 'sstable::data_stream()' that allows
callers to provide their own 'file_input_stream_options'.
This change will be useful in the next commit to enable integrity checking
for file streaming.
The option is a knob that allows to reject dictionary-aware compressors
in the validation stage of CREATE/ALTER statements, and in the
validation of `sstable_compression_user_table_options`. It was
introduced in 7d26d3c7cb to allow the admins of Scylla Cloud to
selectively enable it in certain clusters. For more details, check:
https://github.com/scylladb/scylla-enterprise/issues/5435
As of this series, we want to start offering dictionary compression as
the default option in all clusters, i.e., treat it as a generally
available feature. This makes the knob redundant.
Additionally, making dictionary compression the default choice in
`sstable_compression_user_table_options` creates an awkward dependency
with the knob (disabling the knob should cause
`sstable_compression_user_table_options` to fall back to a non-dict
compressor as default). That may not be very clear to the end user.
For these reasons, mark the option as "Deprecated", remove all relevant
tests, and adjust the business logic as if dictionary compression is
always available.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In commit a3ec6c7d1d we supposedly
implemented the feature of telling TTL experation events from regular
user-sent deletions. However, that implementation did not actually work
at all... It had two bugs:
1. It created an null rjson::value() instead of an empty dictionary
rjson::empty_object(), so GetRecords failed every time such a
TTL expiration event was generated.
2. In the output, it used lowercase field names "type" and "principalId"
instead of the uppercase "Type" and "PrincipalId". This is not the
correct capitalization, and when boto3 recieves such incorrect
fields it silently deletes them and never passes them to the user's
get_records() call.
This patch fixes those two bugs, and importantly - enables a test for
this feature. We did already have such a test but it was marked as
"veryslow" so doesn't run in CI and apparently not even run once to
check the new feature. This test is not actually very long on Alternator
when the TTL period is set very low (as we do in our tests), so I replaced
the "veryslow" marker by "waits_for_expiration". The latter marker means
that the test is still very slow - as much as half an hour - on DynamoDB -
but runs quickly on Scylla in our test setup, and enabled in CI by
default.
The enabled test failed badly before this patch (a server error during
GetRecords), and passes with this patch.
Also, the aforementioned commit forgot to remove the paragraph in
Alternator's compatibility.md that claims we don't have that feature yet.
So we do it now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26633
Until this patch, CDC haven't fetched a preimage for mutations
containing only a partition tombstone. Therefore, single-row deletions
in a table witout a clustering key didn't include a preimage, which was
inconsistent with single-row clustered deletions. This commit addresses
this inconsistency.
Second reason is compatibility with DynamoDB Streams, which doesn't
support entire-partition deletes. Alternator uses partition tombstones
for single-row deletions, though, and in these cases the 'OldImage' was
missing from REMOVE records.
Fixes https://github.com/scylladb/scylladb/issues/26382Closesscylladb/scylladb#26578
cql3: Refactor vector search select impl into a dedicated class
The motivation for this change is crash fixed in https://github.com/scylladb/scylladb/pull/25500.
This commit refactors how ANN ordered select statements are handled to prevent a potential null pointer dereference and improve code organization.
Previously, vector search selects were managed by `indexed_table_select_statement`, which unconditionally dereferenced a `view_ptr`. This assumption is invalid for vector search indexes where no view exists, creating a risk of crashes.
To address this, the refactoring introduces the following changes:
- A new `vector_indexed_table_select_statement` class is created to specifically handle ANN-ordered selects. This class operates without a view_ptr, resolving the null pointer risk.
- The `indexed_table_select_statement` is renamed to `view_indexed_table_select_statement` to more accurately reflect its function with view-based indexes.
- An assertion has been added to `indexed_table_select_statement` constructor to ensure view_ptr is not null, preventing similar issues in the future.
Fixes: VECTOR-162
No backport is needed, as this is refactoring.
Closesscylladb/scylladb#25798
* github.com:scylladb/scylladb:
cql3: Rename indexed_table_select_statement
cql3: Move vector search select to dedicated class
When auto compaction is disabled, all ongoing compactions, including
major compactions, are stopped. However, major compactions should not be
stopped, since the disable request applies only to regular auto
compactions.
This patch fixes the issue by tagging major compaction tasks with the
newly introduced `compaction_type::MajorCompaction`. Since
`table::disable_auto_compaction()` already requests the compaction
manager to stop only tasks of type `compaction_type::Compaction`, major
compactions will no longer be stopped.
Fixes#24501
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Introduce a new compaction_type enum : `Major`.
This type will be used by the next patches to differentiate between
major compaction and regular compaction (compaction_type::Compaction).
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Since 5b6570be52, the default SSTable compression algorithm for user
tables is no longer hardcoded; it can be configured via the
`sstable_compression_user_table_options.sstable_compression` option in
scylla.yaml.
Modify the `test_table_compression` test to get the expected value from
the configuration.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.
the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.
Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.
We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.
This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.
Fixes https://github.com/scylladb/scylladb/issues/26732
backport to 2025.4 where cdc with tablets is introduced
Closesscylladb/scylladb#26160
* github.com:scylladb/scylladb:
test: cdc: extend cdc with tablets tests
cdc: improve cdc metadata loading
We currently allow creating multiple vector indexes on one column.
This doesn't make much sense as we do not support picking one when
making ann queries.
To make this less confusing and to make our behavior similar
to Cassandra we disallow the creation of multiple vector indexes
on one column.
We also add a test that checks this behavior.
Fixes: VECTOR-254
Fixes: #26672Closesscylladb/scylladb#26508
This is a follow-up for https://github.com/scylladb/scylladb/pull/26315. Fixes several review comments that were left unresolved in the original PR.
backport: not needed, this PR contains only renames and code comment fixes
Closesscylladb/scylladb#26745
* https://github.com/scylladb/scylladb:
test_automatic_cleanup: fix comment
storage_proxy: remove stale comment
storage_proxy: improve run_fenceable_write comment
topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes
storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed
storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup
The previous patches added the ability to set
alternator_warn_authorization. In this patch we add to our
documentation a recommendation that this setting be used as an
intermediate step when wanting to change alternator_enforce_authorization
from "false" to "true". We explain why this is useful and important.
The new documentation is in docs/alternator/compatibility.md, where
we previously explained the alternator_enforce_authorization configuration.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds to test_metrics.py tests that authentication and
authorization errors increment, respectively, the new metrics
scylla_alternator_authentication_failures
scylla_alternator_authorization_failures
This patch also adds in test_logs.py tests that verify that that log
messages are generated on different types of authentication/authorization
failures.
The tests also check how configuring alternator_enforce_authorization
and alternator_warn_authorization changes these behaviors:
* alternator_enforce_authorization determines whether an auth error
will cause the request to fail, or the failure is counted but then
ignored.
* alternator_warn_authorization determines whether an auth error will
cause a WARN-level log message to be generated (and also the failure
is counted.
* If both configuration flags are false, Alternator doesn't even
attempt to check authentication or authorization - so errors aren't
even counted.
Because the new tests live-update the alternator_*_authorization
configuration options, they also serve as a test that live-updating
this option works correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, the configuration alternator_enforce_authorization
is a boolean: true means enforce authentication checks (i.e., each
request is signed by a valid user) and authorization checks (the user
who signed the request is allowed by RBAC to perform this request).
This patch adds a second boolean configuration option,
alternator_warn_authorization. When alternator_enforce_authorization
is false but alternator_warn_authorization is true, authentication and
authorization checks are performed as in enforce mode, but failures
are ignored and counted in two new metrics:
scylla_alternator_authentication_failures
scylla_alternator_authorization_failures
additionally,also each authentication or authorization error is logged as
a WARN-level log message. Some users prefer those log messages over
metrics, as the log messages contain additional information about the
failure that can be useful - such as the address of the misconfigured
client, or the username attempted in the request.
All combinations of the two configuration options are allowed:
* If just "enforce" is true, auth failures cause a request failure.
The failures are counted, but not logged.
* If both "enforce" and "warn" are true, auth failures cause a request
failure. The failures are both counted and logged.
* If just "warn" is true, auth failures are ignored (the request
is allowed to compelete) but are counted and logged.
* If neither "enforce" nor "warn" are true, no authentication or
authorization check are done at all. So we don't know about failures,
so naturally we don't count them and don't log them.
This patch is fairly straightforward, doing mainly the following
things:
1. Add an alternator_warn_authorization config parameter.
2. Make sure alternator_enforce_authorization is live-updatable (we'll
use this in a test in the next patch). It "almost" was, but a typo
prevented the live update from working properly.
3. Add the two new metrics, and increment them in every type of
authentication or authorization error.
Some code that needs to increment these new metrics didn't have
access to the "stats" object, so we had to pass it around more.
4. Add log messages when alternator_warn_authorization is true.
5. If alternator_enforce_authorization is false, allow the auth check
to allow the request to proceed (after having counted and/or logged
the auth error).
A separate patch will follow and add documentation suggesting to users
how to use the new "warn" options to safely switch between non-enforcing
to enforcing mode. Another patch will add tests for the new configuration
options, new metrics and new log messages.
Fixes#25308.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We migrate `tablets_test.py::TestTablets::test_moving_tablets_replica_on_node`
from dtests to the repository of Scylla. We divide the test into two
steps to make testing easier and even possible with RF-rack-valid
keyspaces being enforced.
Closesscylladb/scylladb#26285
To align with `vector_indexed_table_select_statement`, this commit renames
`indexed_table_select_statement` to `view_indexed_table_select_statement`
to clarify its usage with materialized views.
The execution of SELECT statements with ANN ordering (vector search) was
previously implemented within `indexed_table_select_statement`. This was
not ideal, as vector search logic is independent of secondary index selects.
This resulted in unnecessary complexity because vector search queries don't
use features like aggregates or paging. More importantly,
`indexed_table_select_statement` assumed a non-null `view_schema` pointer,
which doesn't hold for vector indexes (where `view_ptr` is null).
This caused null pointer dereferences during ANN ordered selects, leading
to crashes (VECTOR-179). Other parts of the class still dereference
`view_schema` without null checks.
Moving the vector search select logic out of
`indexed_table_select_statement` simplifies the code and prevents these
null pointer dereferences.
Previously, streaming readers only verified the checksum of compressed SSTables.
This patch extends checks to also include the digest and the uncompressed checksum (CRC).
These additional checks require reading the digest and CRC components from disk,
which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data,
while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.
If the reader range doesn't cover the full SSTable, the digest check is skipped.
Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations.
This allows callers to enable SSTable integrity checks during single-key reads.
The method connects a socket, grabs in/out streams from it then writes
HTTP request and reads+parses the response. For that it uses class
variables for socket and streams, but there's no real need for that --
all three actually exists throughput the method "lifetime".
To fix it, coroutinizes the method. The same could be achieved my moving
the connected socket and streams into do_with() context, but coroutine
is better than that.
(indentation is left broken)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This changes makes local output_stream variable be constructed in the
declaration statement with the help of ternary operator thus avoiding
both -- default-initialization and move-assignment depending on the
standalone condition checking.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test consists of several snippets, each creating an input_stream for
some short operation and checking the result. Each snipped over-writes
the local `input_stream in` variable with the new one.
This change wraps each of those snippets into own code block in order to
have own new `input_stream in` variable in each.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
– Workload: N workers perform CAS updates
UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev
at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated
as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read.
- Enable balancing and increase min_tablet_count to force split,
flush and lower min_tablet_count to merge.
- “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a
LOCAL_SERIAL read to determine whether the CAS actually applied.
- Invariants: after the run, for every pk and column s{i}, the stored value
equals the number of confirmed CAS by worker i (no lost or phantom updates)
despite ongoing tablet moves.
Closesscylladb/scylladb#26113
extend and improve the tests of virtual tables for cdc with tablets.
split the existing virtual tables test to one test that validates the
virtual tables against the internal cdc tables, and triggering some
tablet splits in order to create entries in the cdc_streams_history
table, and add another test with basic validation of the virtual tables
when there are multiple cdc tables.
The directory_lister uses utils::lister under the hood which accepts a
callback to put directory_entry-s in. The directory_lister's callback
then puts the entries into a queue and its .get() method pops up entries
from there to return to caller.
This patch simplifies this code by switching the directory_lister to use
experimental generator lister from seastar. With it, the entries to be
returned from .get() are simply co_await-ed from calling the generator
object (wich co_yield-s them).
As a result the directory_lister becomes smaller and drops the need for
utils::lister. Since directory_lister was created as a replacement for
that callback-based lister, the latter can be eventually removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26586
Recently (#26231) there was added a test to check that several API
endpoints, that return tokens and corresponding replica nodes, are
consistent with tablet map. This patch adds one more API endpoint to the
validation -- the /storage_service/tokens_endpoint one.
The extention is pretty straightforward, but the new endpoint returns
back a single (primary) replica for a token, so the test check is
slightly modified to account for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26580
We've enabled the configuration option `rf_rack_valid_keyspaces`,
so we can finally re-enable the events creating and dropping secondary
indexes.
Fixesscylladb/scylladb#26422
We adjust the test to work with the configuration option
`rf_rack_valid_keyspaces` enabled. For that, we ensure that there is
always at least one node in each of the three racks. This way, all
keyspaces we create and manipulate will remain RF-rack-valid since they
all use RF=3.
------------------------------------------------------------------------
To achieve that, we only need to adjust the following events:
1. `init_tablet_transfer`
The event creates a new keyspace and table and manually migrates
a tablet belonging to it. As long as we make sure the migration occurs
within the same rack, there will be no problem.
Since RF == #racks, each rack will have exactly one tablet replica,
so we can migrate the tablet to an arbitrary node in the same rack.
Note that there must exist a node that's not a replica. If there weren't
such a node, the test wouldn't have worked before this commit because
it's not possible to migrate a tablet from one node being its replica to
another. In other words, we have a guarantee that there are at least 4 nodes
in the cluster when we try to migrate a tablet replica.
That said, we check it anyway. If there's no viable node to migrate the
tablet replica to, we log that information and do nothing. That should be
an acceptable solution.
2. `add_new_node`
As long as we add a node to an existing rack, there's no way to
violate the invariant imposed by the configuration option, so we pick
a random rack out of the existing three and create a node in it.
3. `decommission_node`
We need to ensure that the node we'll be trying to decommission is
not the only one in its rack.
Following pretty much the same reasoning as in `init_tablet_transfer`,
we conclude there must be a rack with at least two nodes in it. Otherwise
we'd end up having to migrate a tablet from one replica node to another,
which is not possible.
What's more, decommissioning a node is not possible if any node in
the cluster is dead, so we can assume that `manager.running_servers`
returns the whole cluster.
4. `remove_node`
The same as `decommission_node`. Just note although the node we choose to
remove must be first stopped, none other node can be dead, so the whole
cluster must be returned by `manager.running_servers`.
------------------------------------------------------------------------
There's one more important thing to note. The test may sometimes trigger
a sequence of events where a new node is started, but, due to an error
injection, its initialization is not completed. Among other things, the
node may NOT have a host ID recognized by the rest of the nodes in the
cluster, and operations like tablet migration will fail if they target
it.
Thankfully, there seems to be a way to avoid problems stemming from
that. When a new node is added to the cluster, it should appear at the
end of the list returned by `manager.running_servers`. This most likely
stems from how dictionaries work in Python:
"Keys and values are iterated over in insertion order."
-- https://docs.python.org/3/library/stdtypes.html#dict-views
and the fact that we keep track of running servers using a dictionary.
Furthermore, we rely on the assumption that the test currently works
correctly.
Assume, to the contrary, that among the nodes taking part in the operations
listed above, there is at most one node per rack that has its host ID recognized
by the rest of the cluster. Note that only those nodes can store any tablets.
Let's refer to the set of those nodes as X.
Assume that we're dealing with tablet migration, decommissioning, or removing
a node. Since those operations involve tablet migration, at least one tablet
will need to be migrated from the node in question to another node in X.
However, since X consists of at most three nodes, and one of them is losing
its tablet, there is no viable target for the tablet, so the operation fails.
Using those assumptions, an auxiliary function, `select_viable_rack`,
was designed to carefully choose a correct rack, which we'll then pick nodes
from to perform the topological operations. It's simple: we just find the first
rack in the list that has at least two nodes in it. That should ensure that we
perform an operation that doesn't lead to any unforeseen disaster.
------------------------------------------------------------------------
Since the test effectively becomes more complex due to more care for keeping
the topology of the cluster valid, we extend the log messages to make them
more helpful when debugging a failure.
This patch adds reproducing tests in test/alternator for issue #23438,
which is about missing checks for the length of headers and the URL
in Alternator requests. These should be limited, because Seastar's
HTTP server, which Scylla uses, reads them into memory so they can OOM
Scylla.
The tests demonstrate that DynamoDB enforces a 16 KB limit on the
headers and the URL of the request, but Scylla doesn't (a code
inspection suggests it does not in fact have any limit).
The two tests pass on DynamoDB and currently xfail on Alternator.
Refs #23438.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23442
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.
That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26595
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.
Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
- batchlog reply fails;
- repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!
Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.
Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.
Fixes: https://github.com/scylladb/scylladb/issues/24415
Data resurrection fix; needs backport to all versions
Closesscylladb/scylladb#26319
* github.com:scylladb/scylladb:
db: fix indentation
test: add reproducer for data resurrection
repair: fail tablet repair if any batch wasn't sent successfully
db/batchlog_manager: fix making decision to skip batch replay
db: repair: throw if replay fails
db/batchlog_manager: delete batch with incorrect or unknown version
db/batchlog_manager: coroutinize replay_all_failed_batches
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. We fix the
bug in this PR.
Implementation strategy:
1. Move code responsible for producing the schema
of a secondary index to the file that handles
`CREATE INDEX`.
2. Set the property when creating the view.
3. Add reproducer tests.
Fixesscylladb/scylladb#26542
Backport: we can discuss it.
Closesscylladb/scylladb#26543
* github.com:scylladb/scylladb:
index: Set tombstone_gc when creating secondary index
index: Make `create_view_for_index` method of `create_index_statement`
index: Move code for creating MV of secondary index to cql3
db, cql3: Move creation of underlying MV for index
Following DynamoDB, Alternator also places a 16 MB limit on the size of
a request. Such a limit is necessary to avoid running out of memory -
because the AWS message authentication protocol requires reading the
entire request into memory before its signature can be verified.
Our implementation for this limit used Seastar's HTTP server's
content_length_limit feature. However, this Seastar feature is
incomplete - it only works when the request uses the Content-Length
header, and doesn't do anything if the request doesn't have a
Content-Length (it may use chunked encoding, or have no length at all).
So malicious users can cause Scylla to OOM by sending a huge request
without a Content-Length.
So in this patch we stop using the incomplete Seastar feature, and
implement the length limit in Scylla in a way that works correctly with
or without Content-Length: We read from the input stream and if we go
over 16MB, we generate an error.
Because we dropped Seastar's protection against a long Content-Length,
we also need to fix a piece of code which used Content-Length to reserve
some semaphore units to prevent reading many large requests in parallel.
We fix two problems in the code:
1. If Content-Length is over the limit, we shouldn't attempt to reserve
semaphore units - this should just be a Payload Too Large error.
2. If Content-Length is missing, the existing code did nothing and had
a TODO that we should. In this patch we implement what was suggested
in that TODO: We temporarily reserve the whole 16 MB limit, and
after reading the actual request, we return part of the reservation
according to the real request size.
That last fix is important, because typically the largest requests will be
BatchWriteItem where a well-written client would want to use chunked
encoding, not Content-Length, to avoid materializing the entire request
up-front. For such clients, the memory use semaphore did nothing, and
now it does the right thing.
Note that this patch does *not* solve the problem #12166 that existed
with Seastar's length-limiting implementation but still exists in the
new in-Scylla length-limiting implementation: The fact we send an
error response in the middle of the request and then close the
connection, while the client continues to send the request, can lead
to an RST being sent by the server kernel. Usually this will be fine -
well-written client libraries will be able to read the response before
the RST. But even with a well-written library in some rare timings
the client may get the RST before the response, and will miss the
response, and get an empty or partial response or "connection reset
by peer". This issue existed before this patch, and still exists, but
is probably of minor impact.
Fixes#8196
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23434
The utils library requires OpenSSL's libcrypto for cryptographic
operations and without linking libcrypto, builds fail with undefined
symbol errors. Fix that by linking `crypto` to `utils` library when
compiled with cmake. The build files generated with configure.py already
have `crypto` lib linked, so they do not have this issue.
Fix#26705
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26707
This change adds the ability to move tablets sizes in load_stats after a
tablet migration or table resize (split/merge). This is needed because
the size based load balancer needs to have tablet size data which is as
accurate as possible, in order to issue migrations which improve
load balance.
In #24031 users complained, that trace message is truncated, namely it's
no longer json parsable and table name might not be part of the output.
This path enables users to configure maximum size of trace message.
In case user wanted `table` name, but didn't care about message size,
#26634 will help.
- add configuration varable `alternator_max_users_query_size_in_trace_output`
with default value of 4096 (4 times old default value).
- modify `truncated_content_view` function to use new configuration
variable for truncation limit
- update `truncated_content_view` to consistently truncate at given
size, previously trunctation would also happen when data arrived in
more than one chunk
- update `truncated_content_view` to better handle truncated value
(limit number of copies)
- fix `scylla_config_read` call - call to `query` for a configuration
name that is not existing will return `Items` array empty
(but present) - this would raise array access exception few lines
below.
- add test
Refs #26634
Refs #24031Closesscylladb/scylladb#26618
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files.
test_tablets2::test_tablet_load_and_stream was enhanced to also verify that SSTables are removed after being streamed.
Fixes#26606
Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444.
Closesscylladb/scylladb#26622
* github.com:scylladb/scylladb:
test_tablets2: verify SSTable cleanup after tablet load and stream
tablet_sstable_streamer: replace unlink() call with mark_for_deletion()
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.
Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.
Fixes#19959Closesscylladb/scylladb#26638
With the recent introduction of retry_strategy to Seastar, the pure virtual class previously defined in ScyllaDB is now redundant. This change allows us to streamline our codebase by directly inheriting from Seastar’s implementation, eliminating duplication in ScyllaDB.
Despite this update is purely a refactoring effort and does not introduce functional changes it should be ported back to 2025.3 and 2025.4 otherwise it will make future backports of bugfixes/improvements related to `s3_client` near to impossible
ref: https://github.com/scylladb/seastar/issues/2803
depends on: https://github.com/scylladb/seastar/pull/2960Closesscylladb/scylladb#25801
* github.com:scylladb/scylladb:
s3_client: remove unnecessary `co_await` in `make_request`
s3 cleanup: remove obsolete retry-related classes
s3_client: remove unused `filler_exception`
s3_client: fix indentation
s3_client: simplify chunked download error handling using `make_request`
s3_client: reformat `make_request` functions for readability
s3_client: eliminate duplication in `make_request` by using overload
s3_client: reformat `make_request` function declarations for readability
s3_client: reorder `make_request` and helper declarations
s3_client: add `make_request` override with custom retry and error handler
s3_client: migrate s3_client to Seastar HTTP client
s3_client: fix crash in `copy_s3_object` due to dangling stream
s3_client: coroutinize `copy_s3_object` response callback
aws_error: handle missing `unexpected_status_error` case
s3_creds: use Seastar HTTP client with retry strategy
retry_strategy: add exponential backoff to `default_aws_retry_strategy`
retry_strategy: introduce Seastar-based retry strategy
retry_strategy: update CMake and configure.py for new strategy
retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy`
retry_strategy: fix include
retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh
retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc
When a tablet is migrated between shards on the same node, during the write_both_read_new state we begin switching reads to the new shard. Until the corresponding global barrier completes, some requests may still use write_both_read_old erm, while others already use the write_both_read_new erm. To ensure mutual exclusion between these two types of requests, we must acquire locks on both the old and new shards. Once the global barrier completes, no requests remain on the old shard, so we can safely switch to acquiring locks only on the new shard.
The idea came from the similar locking problem in the [counters for tablets PR](https://github.com/scylladb/scylladb/pull/26636#discussion_r2463932395).
Fixesscylladb/scylladb#26727
backport: need to backport to 2025.4
Closesscylladb/scylladb#26719
* https://github.com/scylladb/scylladb:
paxos_state: use shards_ready_for_reads
paxos_state: inline shards_for_writes into get_replica_lock
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.
Fixes https://github.com/scylladb/scylladb/issues/25341Closesscylladb/scylladb#25456
* github.com:scylladb/scylladb:
mv: limit concurrent view updates from all sources
database: rename _view_update_concurrency_sem to _view_update_memory_sem
Currently, tablet_sstable_streamer::get_primary_endpoints is out of
sync with tablet_map::get_primary_replica. The get_primary_replica
optimizes the choice of the replica so that the work is fairly
distributes among nodes. Meanwhile, get_primary_endpoints always
chooses the first replica.
Use get_primary_replica for get_primary_endpoints.
Fixes: https://github.com/scylladb/scylladb/issues/21883.
Closesscylladb/scylladb#26385
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.
the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.
Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.
We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.
This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.
Fixesscylladb/scylladb#26732
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.
The effect of this patch can also be observed when writing to a base
table with a large number of materialized views, like in the
materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent
dtest. In that test, if we perform a full scan in parallel to a write
workload with a concurrency of 100 to a table with 100 views, the scan
would sometimes timeout because it would effectively get 1/10000 of cpu.
With this patch, the cpu concurrency of view updates was limited to 128
(we ran both writes and scan in the same service level), and the scan
no longer timed out.
Fixes https://github.com/scylladb/scylladb/issues/25341
The std::optional<T> inject_parameter(...) method is a template, and in
dev/debug modes this parameter is defaulted to std::string_view, but for
release mode it's not. This patch makes it symmetrical.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26706
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets.
The previous implementation unlinked SSTables immediately after streaming them for the first tablet,
potentially making them partially unavailable for subsequent tablets.
This patches replaces unlink() call with mark_for_deletion()
Acquiring locks on both shards for the entire tablet migration period
is redundant. In most cases, locking only the old shard or only the new
shard is sufficient. Using shards_ready_for_reads reduces the
situations in which we need to lock both shards to:
* intra-node migrations only
* only during the write_both_read_new state
Once the global barrier completes in the write_both_read_new state, no
requests remain on the old shard, so we can safely acquire locks
only on the new shard.
Fixesscylladb/scylladb#26727
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.
the garbage collection works by finding the newest cdc timestamp that has been
closed for more than the configured cdc TTL, and removing all information from
the cdc internal tables about cdc timestamps and streams up to this timestamp.
in general it should be safe to remove information about these streams because
they are closed for more than TTL, therefore all rows that were written to these streams
with the configured TTL should be dead.
the exception is if the TTL is altered to a smaller value, and then we may remove information
about streams that still have live rows that were written with the longer ttl.
Fixes https://github.com/scylladb/scylladb/issues/26669Closesscylladb/scylladb#26410
* github.com:scylladb/scylladb:
cdc: garbage collect CDC streams periodically
cdc: helpers for garbage collecting old streams for tablets
Currently when a null vector is passed to an ANN query we fail with a
quite confusing error ("NoHostAvailable: ('Unable to complete the
operation against any hosts', {<Host: 127.0.0.1:9042 datacenter1>:
<Error from server: code=0000 [Server error] message="to_bytes() called
on raw value that is null">})").
This patch fixes that by throwing an InvalidRequestException with an
appropriate message instead.
We also add a test case that validates this behavior.
Fixes: VECTOR-257
Closesscylladb/scylladb#26510
Fixes#26641
* Adds shared abstraction for dockerized mock services for out pytests (not using python docker, due to both library and podman)
* Adds test fixtures for our key providers (except GCS KMS, for which we have no mock server) to do local testing
* Ports (and prunes and sharpens) the test cases from dtest::encryption_at_rest_test to our pytest.
* Shared KMIP mock between boost test and pytest and speeds up boost test shutdown.
When merged, the dtest counterpart can be decommissioned.
Closesscylladb/scylladb#26642
* github.com:scylladb/scylladb:
test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper
test::cluster::test_encryption: Port dtest EAR tests
test::cluster::conftest: Add key_provider fixture
test::pylib::encryption_provider: Port dtest encryption provider classes
test::pylib::dockerized_service: Add helper for running docker/podman
test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures
test::boost::kmip_wrapper: Move python script for PyKMIP to pylib
Problems addressed by this PR
* Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up.
* Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations.
* `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details).
* Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`:
* The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited.
* Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between.
* It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them.
* It also waited for writes with versions greater than the current `fence_version`, which is unnecessary.
Fixesscylladb/scylladb#26150
backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting
Closesscylladb/scylladb#26315
* https://github.com/scylladb/scylladb:
test_automatic_cleanup: add test_cleanup_waits_for_stale_writes
test_fencing: fix due to new version increment
test_automatic_cleanup: clean it up
storage_proxy: wait for closing sessions in sstable cleanup fiber
storage_proxy: rename await_pending_writes -> await_stale_pending_writes
storage_proxy: use run_fenceable_write
storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
storage_proxy: introduce run_fenceable_write
storage_proxy: move update_fence_version from shared_token_metadata
storage_proxy: fix start_write() operation scope in apply_locally
storage_proxy: move post fence check into handle_write
storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
storage_proxy::handle_read: add fence check before get_schema
storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
sstable_cleanup_fiber: use coroutine::parallel_for_each
storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock
topology_coordinator: barrier before cleanup
topology_coordinator: small start_cleanup refactoring
global_token_metadata_barrier: add fenced flag
No need to have two functions since both callers of get_replica_lock()
use shards_for_writes() to compute the shards where the locks
must be acquired.
Also while at it, inline the acquire() lambda in get_replica_lock()
and replace it with a loop over shards. This makes the code
more strightforward.
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.
- get_new_base_for_gc: finds a new base timestamp given a TTL, such that
all older timestamps and streams can be removed.
- get_cdc_stream_gc_mutations: given new base timestamp and streams,
builds mutations that update the internal cdc tables and remove the
older streams.
- garbage_collect_cdc_streams_for_table: combines the two functions
above to find a new base and build mutations to update it for a
specific table
- garbage_collect_cdc_streams: builds gc mutations for all cdc tables
In theory, scylla-sstable write is an awesome and flexible tool to generate sstables with arbitrary content. This is convenient for tests and could come clutch in a disaster scenario, where certain system table's content need to be manually re-created, system tables that are not writable directly via CQL.
In practice, in its current form this operation is so convoluted to use that even its own author shuns it. This is because the JSON specification of the sstable content is the same as that of the scylla-sstable dump-data: containing every single piece of information on the mutation content. Where this is an advantage for dump-data, allowing users to inspect the data in its entirety -- it is a huge disadvantage for write, because of all these details have to be filled in, down to the last timestamp, to generate an sstable. On top of that, the tool doesn't even support any of the more advanced data types, like collections, UDF and counters.
This PR proposes a new way of generating sstables: based on the success of scylla-sstable query, it introduces CQL support for scylla-sstable write. The content of the sstable can now be specified via standard INSERT, UPDATE and DELETE statements, which are applied to a memtable, then flushed into the sstable.
To avoid boundless memory consumption, the memtable is flushed every time it reaches 1MiB in size, consequently the command can generate multiple output sstables.
The new CQL input-format is made default, this is safe as nobody is using this command anyway. Hopefully this PR will change that.
Fixes: https://github.com/scylladb/scylladb/issues/26506
New feature, no backport.
Closesscylladb/scylladb#26515
* github.com:scylladb/scylladb:
test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql
replica/mutation_dump: add support for virtual tables
tools/scylla-sstable: print_query_results_json(): handle empty value buffer
tools/scylla-sstable: add cql support to write operation
tools/scylla-sstable: write_operation(): fix indentation
tools/scylla-sstable: write_operation(): prepare for a new input-format
tools/scylla-sstable: generalize query_operation_validate_query()
tools/scylla-sstable: move query_operation_validate_query()
tools/scylla-sstable: extract schema transformation from query operation
replica/table: add virtual write hook to the other apply() overload too
`select * from mutation_fragment()` queries don't return partitions which are completely empty or only contain tombstones which are all garbage collectible. This is because the underlying `mutation_dump` mechanism has a separate query to discover partitions for scans. This query is a regular mutation scan, which is subject to query compaction and garbage collection. Disable the query compaction for mutation queries executed on behalf of mutation fragment queries, so *all* data is visible in the result, even that which is fully garbage collectible.
Fixesscylladb/scylladb#23707.
Scans for mutation-fragment are very rare, so a backport is not necessary. We can backport on-demand.
Closesscylladb/scylladb#26227
* github.com:scylladb/scylladb:
replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
replica: add tombstone_gc_enabled parameter to mutation query methods
mutation/mutation_compactor: remove _can_gc member
tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc
This patch changes the tablet size map in load_stats. Previously, this
data structure was:
std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes;
and is changed into:
std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes;
This allows for improved performance of tablet tablet size reconciliation.
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.
Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.
Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.
This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.
Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581Closesscylladb/scylladb#26589
Refactor `chunked_download_source` to eliminate redundant exception
handling by leveraging the new `make_request` override with custom
retry strategy. This streamlines the download fiber logic, improving
readability and maintainability.
Reformats the `make_request` function declarations to improve readability
due to the large number of arguments. This aligns with our formatting
guidelines and makes the code easier to maintain.
handler
Introduce an override for `make_request` in `s3_client` to support
custom retry strategies and error handlers, enabling flexibility
beyond the default client behavior and improving control over request
handling
In the `copy_part` method, move the `input_stream<char>` argument
into a local variable before use. Failing to do so can lead to a
SIGSEGV or trigger an abort under address sanitizer.
Add a missing `case` clause to the `switch` statement to correctly
handle scenarios where `unexpected_status_error` is thrown. This
fixes overlooked error handling and improves robustness.
In AWS credentials providers, replace `retryable_http_client` with
Seastar's native HTTP client. Integrate the newly added
`default_aws_retry_strategy` to handle retries more efficiently and
reduce dependency on external retry logic.
Add a new class derived from Seastar's `default_retry_strategy`.
Relocate the `should_retry` implementation from Scylla's
`default_retry_strategy` into the new class to centralize and
standardize retry behavior.
Renames the `default_retry_strategy` class to `default_aws_retry_strategy`
to clarify its association with the S3 client implementation. This avoids
confusion with the unrelated `seastar::default_retry_strategy` class.
Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.
If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.
If any batch replay failed, we cannot update repair_time as we risk the
data resurrection.
If replay of any batch needs to be retried, run the whole repair but
fail at the very end, so that the repair_time for it won't be updated.
Currently, we skip batch replay if less than batch_log_timeout passed
from the moment the batch was written. batch_log_timeout value can
be configured. If it is large, it won't be replayed for a long time.
If the tombstone will be GC'd before the batch is replayed, then we
risk the data resurrection.
To ensure safety we can skip only the batches that won't be GC'd.
In this patch we skip replay of the batches for which:
now() < written_at + min(timeout + propagation_delay)
repair_time is set as a start of batchlog replay, so at the moment
of the check we will have:
repair_time <= now()
So we know that:
repair_time < written_at + propagation_delay
With this condition we are sure that GC won't happen.
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.
batchlog_manager::replay_all_failed_batches skips batches that have
unknown or incorrect version. Next round will process these batches
again.
Such batches will probably be skipped everytime, so there is no point
in keeping them. Even if at some point the version becomes correct,
we should not replay the batch - it might be old and this may lead
to data resurrection.
This PR introduces support for a new scrub option: `--drop-unfixable-sstables`, which enables the dropping of corrupted SSTables during scrub only in segregate mode. The patch includes implementation, validation, and set of tests to ensure correct behavior and error handling.
Fixes#19060
Backport is not required, it is a new feature
Closesscylladb/scylladb#26579
* github.com:scylladb/scylladb:
sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option
test/nodetool: add scrub drop-unfixable-sstables option testcase
scrub: add support for dropping unfixable sstables in segregate mode
In the following commit, we'll introduce a new semaphore for view updates
that limits their concurrency by view update count. To avoid confusion,
we rename the existing semaphore that tracks the memory used by concurrent
view updates and related objects accordingly.
The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the `tablet_id` field (`_tablet`), which means the guard can no longer correctly protect ongoing operations from tablet migrations.
The problem is specific to LWT, since `tablet_metadata_guard` is used mostly for heavy topology operations, which exclude with split and merge. The guard was used for LWT as an optimization -- we don't need to block topology operations or migrations of unrelated tablets. In the future, we could use the guard for regular reads/writes as well (via the `token_metadata_guard` wrapper).
Fixes [scylladb/scylladb#26437](https://github.com/scylladb/scylladb/issues/26437)
backports: need to backport to 2025.4 since the bug is relevant to LWT over tablets.
Closesscylladb/scylladb#26619
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_tablets_merge_waits_for_lwt
test.py: add universalasync_typed_wrap
tablet_metadata_guard: fix split/merge handling
tablet_metadata_guard: add debug logs
paxos_state: shards_for_writes: improve the error message
storage_service: barrier_and_drain – change log level to info
topology_coordinator: fix log message
Added a new test case, sstable_scrub_segregate_mode_drop_unfixable_sstables_test,
which verifies that when the drop-unfixable-sstables flag is enabled in segregate
mode, corrupted SSTables are correctly dropped.
This patches introduces the test_scrub_drop_unfixable_sstables_option testcase,
which verifies that correct request is generated when the --drop-unfixable-sstables flag is used.
It also validates that an error is thrown if the drop-unfixable-sstables
flag is enabled and mode is not set to SEGREGATE.
This patch introduces test_scrub_drop_unfixable_sstables_option, which test
This patch adds a new flag `drop-unfixable-sstables` to the scrub operation
in segregate mode, allowing to automatically drop SSTables that
cannot be fixed during scrub. It also includes API support of the 'drop_unfixable_sstables'
paramater and validation to ensure this flag is not enabled in other modes rather than segragate.
Topology version is now bumped when a node finishes bootstrapping.
As a result, fence_version == version - 1, and decrementing version
in the test no longer triggers a stale topology exception.
Fix: run cleanup_all to invoke the global barrier, which synchronizes
fence_version := version on all nodes.
Remove redundant imports and variables. Extract cleanup_all
function. Add logs. Remove pytest.mark.prepare_3_racks_cluster --
the test doesn't actually need a 3 node cluster, one initial
node is enough.
All mutation_holder::apply_locall() implementations now do the same
post fence chech. In this commit we hoist this check up to
abstract_write_response_handler::apply_locally().
This function is intended to replace start_write() in subsequent
commits. It provides the following benefits:
* Remove duplication: All start_write() call sites must run the fence
check after the operation completes. run_fenceable_write() encapsulates
this pattern.
* Fix a race: To ensure no new stale write operations occur during
cleanup, a fence check before start_write() was previously used.
However, yields in several code paths between the check and
start_write() made it non-atomic, allowing a stale operation to slip in
if the fence_version was updated in between.
* Optimize waiting: We do not need to wait for all operations—only for
vnode-based, non-local tables with versions smaller than the current
fence_version.
Future commits will extend update_fence_version, and it is simpler to do
so if the function resides in storage_proxy. Additionally, fence_version
is the only field this function accesses, and it is used solely within
storage_proxy, making this change natural on its own.
The operation must be held during the local write. Before this commit,
its scope ended after returning from apply_locally(), so it
did not actually provide any protection.
handle_write() is invoked from receive_mutation_handler() and
handle_paxos_learn(), and both previously performed a fence check in
apply_fn. This commit hoists the fence check into handle_write() to
reduce code duplication.
Additionally, move start_write() after get_schema_for_write(), since
there is no need to hold the operation while querying the schema.
As noted in the code comments, start_write() does not need to be held
during counter replication; it is required only while performing local
storage modifications. Move the start_write() call and the fence
check down to mutate_counter_on_leader_and_replicate().
Additionally, mutate_counters_on_leader() is updated to check for
possible stale_topology_exception() and properly package them
in the resulting exception_variant structure.
The flush_all_tables() call ensures that no obsolete, cleanup-eligible
writes remain in the commitlog. This does not need to run under the
group0 lock, so move it outside.
Also, run await_pending_writes() before flush_all_tables(), since
pending writes may include data that must be cleaned up.
Finally, add more detailed info-level logs to trace the stages of the
cleanup procedure.
Cleanup needs a barrier to make sure that no request coordinators
are sending requests to old replicas/ranges that we're going to cleanup.
For example, during node bootstrap, the cleanup
process on replicas must be protected against coordinators running
write_both_read_new and sending requests to old ranges.
We run a barrier to ensure that most data-plane requests with the old
topology finish before cleanup starts. At the same time, we do not want
to block cleanup if the barrier fails on some replicas. Once the fence
is committed to group0, we can safely proceed, since any late request
with the old topology will be fenced out on the replica.
The test for this case is added in a separate commit
"test_automatic_cleanup: add test_cleanup_waits_for_stale_writes"
Rename start_cleanup -> start_vnodes_cleanup for clarity.
Pass topology_request and server_id in start_vnodes_cleanup, we will
need them for better logging later.
Cleanup needs a barrier. For example, during node bootstrap, the cleanup
process on replicas must be protected against coordinators running
write_both_read_new and sending requests to old ranges.
We run a barrier to ensure that most data-plane requests with the old
topology finish before cleanup starts. At the same time, we do not want
to block cleanup if the barrier fails on some replicas. Once the fence is
committed to group0, we can safely proceed, since any late request with
the old topology will be fenced out on the replica.
To support this, introduce a "fenced" flag. The client can pass a pointer
to a bool, which will be set to true after the new fenced_version is
committed.
The patch e34deb72f9 (repair: Rename incremental mode name)
missed one place that references the removed regular mode name.
Fixes#26503Closesscylladb/scylladb#26660
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.
We fix this issue in this PR and add a regression test.
This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.
Fixes#26534
This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).
Closesscylladb/scylladb#26612
* github.com:scylladb/scylladb:
test: test group0 tombstone GC in the Raft-based recovery procedure
group0_state_id_handler: remove unused group0_server_accessor
group0_state_id_handler: consider state IDs of all non-ignored topology members
The universalasync.wrap function doesn't preserve the
type information, which confuses the VS Code Pylance
plugin and makes code navigation hard.
In this commit we fix the problem by adding a typed
wrapped around universalasync.wrap.
Fixes: scylladb/scylladb#26639
The guard should stop refreshing the ERM when the number of tablets
changes. Tablet splits or merges invalidate the tablet_id field
(_tablet), which means the guard can no longer correctly protect
ongoing operations from tablet migrations.
Fixesscylladb/scylladb#26437
Debugging global barrier issues is difficult without these logs.
Since barriers do not occur frequently, increasing the log level should not produce excessive output.
Among other things, the merge includes the patch "http: add "Connection:
close" header to final server response.". This Fixes#26298: A missing
response header meant that a test's client code sometimes didn't notice
that the server closed the connection (since the client didn't need to
use the connection again), which made one test flaky.
* seastar bd74b3fa...63900e03 (6):
> Merge 'Rework output_stream::slow_write()' from Pavel Emelyanov
output_stream: Fix indentation of the slow_write() method
output_stream: Remove pointless else
output_stream: Replace std::swap with std::exchange
output_stream: Unify some code-paths of slow_write()
> Merge 'Deprecate in/out streams move-assignment operator' from Pavel Emelyanov
iostream: Deprecate input/output stream default constructor and move-assignment operator
test: Sub-split test-cases
test: Don't reuse output_stream in file demo
test: Keep input_/output_stream as optional
util: Construct file_data_source in with_file_input_stream()
websocket: Construct in/out in initializer list
rpc: Wrap socket and buffers
> scripts/perftune.py: detect corrupted NUMA topology information
> Merge 'memory, smp: support more than 256 shards' from Avi Kivity
reactor, smp: allocate smp queues across all shards
memory: increase maximum shard count
memory: make cpu_id_shift and related mask dynamic
resource, memory: move memory limit calculation to memory.cc
resource: don't error if --overprovisioned and asking for more vcpus than available
> Merge 'Update perf_test text output, make columns selectable' from Travis Downs
perf_tests: enhance text output
perf_test_tests: add some check_output tests
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.
In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).
Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.
Fixesscylladb/scylladb#26615
This fix should be backported to 2025.4.
Closesscylladb/scylladb#26657
* github.com:scylladb/scylladb:
test/alternator/test_tablets: add test for GSI backfill with tablets
test/alternator/test_tablets: add reproducer for GSI with tablets
alternator/executor: instantly mark view as built when creating it with base table
While there is a docker interface for python, need to deal with
the docker-in-docker issues etc. This uses pure subprocess and
stream parse. Meant to provide enough flexibility for all our
docker mock server needs.
Other than patching Scylla sinks to implement new data_sink_impl::put(std::span<temporary_buffer>) overload, the PR changes transport write_response() method to stop using output_stream::write(scattered_message) because it's also gone.
Using newer seastar API, no need to backport
Closesscylladb/scylladb#26592
* github.com:scylladb/scylladb:
code: Fix indentation after previous patch
code: Switch to seastar API level 9
transport: Open-code invoke_with_counting into counting_data_sink::put
transport: Don't use scattered_message
utils: Implement memory_data_sink::put(net::packet)
The test should pass without the fix for scylladb/scylladb#26615,
because the `executor::updata_table()` uses
`service::prepare_new_view_announcement()`, which creates view building
tasks for the view.
But it's better to add this test.
Rewrite wait_for first_completed to return only first completed task guarantee
of awaiting(disappearing) all cancelled and finished tasks
Use wait_for_first_completed to avoid false pass tests in the future and issues
like #26148
Use gather_safely to await tasks and removing warning that coroutine was
not awaited
Closesscylladb/scylladb#26435
There are some environment which has corrupted NUMA topology
information, such as some instance types on AWS EC2 with specific Linux
kernel images.
On such environment, we cannot get HW information correctly from hwloc,
so we cannot proceed optimization on perftune.
To avoid causing script error, check NUMA topology information and skip
running perftune if the information corrupted.
Related scylladb/seastar#2925Closesscylladb/scylladb#26344
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.
In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).
Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.
Fixesscylladb/scylladb#26615
`shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`.
A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`.
Fixes scylladb/scylladb#26355
backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets
Closesscylladb/scylladb#26408
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_lwt_shutdown
storage_proxy: wait for write handler destruction
storage_proxy: coroutinize cancel_write_handlers
storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
This patch removes the dependence of vector search module
on the cql3 module by moving the contents of cql3/type_json.hh
to types/json_utils.hh and removing the usage of cql3 primary_key
object in vector_store_client. We also make the needed adjustments
to files that were previously using the afformentioned type_json.hh
file.
This fixes the circular dependency cql3 <-> vector_search.
Closesscylladb/scylladb#26482
Add a simple test verifying our changes for the compatible CDC schema.
The test checks we can write to a table with CDC enabled after ALTER and
after node restart.
in the CDC log transformer, when augmenting a base mutation, use the CDC
log schema that is compatible with the base schema, if set.
Now that the base schema has a pointer to its CDC schema, we can use it
instead of getting the current schema from the db, which may not be
compatible with the base schema.
The compatible CDC schema may not be set if the cluster is not using
raft mode for schema. In this case, we maintain the previous behavior.
When creating a schema for a non-CDC table in the schema_applier, find
its CDC schema that we created previously in the same operation, if any,
and create the schema with a pointer to the CDC schema.
We use the fact that for a base table with CDC enabled, its CDC schema
is created or altered together in the same group0 operation.
Similarly, in schema_tables, when creating table schemas from the
schema tables, first create all schemas that don't have CDC enabled,
then create schemas that have CDC enabled by extending them with the
pointer to the CDC schema that we created before.
There are few additional cases where we create schemas that we need to
consider how to handle.
When loading a schema from schema tables in the schema_loader we decide
not to set the CDC schema, because this schema is mostly used for tools
and it's not used for generating CDC mutations.
When transporting a schema by RPC in the migration manager, we don't
transport its CDC schema, and we always set it to null. Because we use
raft we expect this shouldn't have any effect, because the schema is
synchronized through raft and not through the RPC.
Previously in the schema applier we have two maps of schema_mutations,
for tables and for views. Now create another map for CDC tables by
extracting them from the non-views tables map.
We maintain the previous behavior by applying each operation that's done
on the tables map, to the CDC map as well.
Later we will want to handle CDC and non-CDC tables differently. We want
to be able to create all CDC schemas first, so when we create the
non-CDC tables we can create them with a pointer to their CDC schemas.
Add to the schema object a member that points to the CDC schema object
that is compatible with this schema, if any.
The compatible CDC schema is created and altered with its base schema in
the same group0 operation.
When generating CDC log mutations for some base mutation we want them to
be created using a compatible schema thas has a CDC column corresponding
to each base column. This change will allow us to find the right CDC
schema given a base mutation.
We also update the relevant structures in the schema registry that are
related to learning about schemas and transporting schemas across
shards or nodes.
When transporting a schema as frozen_schema, we need to transport the
frozen cdc schema as well, and set it again when unfreezing and
reconstructing the schema.
When adding a schema to the registry, we need to ensure its CDC schema
is added to the registry as well.
Currently we always set the CDC schema to nullptr and maintain the
previous behavior. We will change it in a later commit. Until then, we
mark all places where CDC schema is passed clearly so we don't forget
it.
remove the _base_info member from global_schema_ptr, and used the
base_info we have stored in the schema registry entry instead.
Currently when constructing a global_schema_ptr from a schema_ptr it
extracts and stores the base_info from the schema_ptr. Later it uses it
to reconstruct the schema_ptr, together with the frozen schema from the
schema registry entry.
But we can use the base_info that is already stored in the
schema registry entry.
Change the schema loader type in the schema_registry to return a
extended_frozen_schema instead of view_schema_and_base_info, and
remove view_schema_and_base_info which is not used anymore.
The casting between them is trivial.
The schema_registry_entry holds a frozen_schema and a base_info. The
base_info is extracted from the schema_ptr on load of a schema_ptr, and
it is used when unfreezing the schema.
But this is exactly what extended_frozen_schema is doing, so we can
just store an object of this type in the schema_registry_entry.
This makes the code simpler because the schema registry doesn't need to
be aware of the base_info.
Currently we construct a frozen schema with base info in few places, and
the caller is responsible for constructing the frozen schema and extracting
the base info if it's a view table.
We change it to make it simpler and remove the burden from the caller.
The caller can simply pass the schema_ptr, and the constructor for
extended_frozen_schema will construct the frozen schema and extract
the additional info it needs. This will make it easier to add additional
fields, and reduces code duplication.
We also make temporary castings between extended_frozen_schema and
view_schema_and_base_info for the transition, which are trivial, until
they are combined to a single type.
This commit starts a series of refactoring commits of the frozen_schema
to reduce duplication and make it easier to extend.
Currently there are two essentially identical types,
frozen_schema_with_base_info and view_schema_and_base_info in the
schema_registry that hold a frozen_schema together with a base_info for
view schemas.
Their role is to pass around a frozen schema together with additional
info that is extracted from the schema and passed around with it when
transporting it across shards or nodes, and is needed for
reconstructing it, and it is not part of the schema mutations.
Our goal is to combine them to a single type that we will call
extended_frozen_schema.
The recursive call to alter_system_schema() was missing the await
keyword, which meant the coroutine was never actually executed and
the test wasn't doing what it was supposed to do.
Not backporting: Test fix only.
Closesscylladb/scylladb#26623
Add `serve` impl that does not mess with signals, and shutdown
that does not mess with threads. Also speed up standalone shutdown
to make boost tests less slow.
In clang 21, the -fextend-variable-liveness option was made
default [1] with -Og. It helps reduce "optimized out" problems while
debugging.
However, it conflicts [2] with coroutines.
To prevent problems during the upgrade to Clang 21, disable the option.
[1] 36af7345df
[2] https://github.com/llvm/llvm-project/issues/163007Closesscylladb/scylladb#26573
Apply two main changes to the s3_client error handling
1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header
2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether
Fixes: https://github.com/scylladb/scylladb/issues/26483
Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions
Closesscylladb/scylladb#26527
* github.com:scylladb/scylladb:
s3_client: tune logging level
s3_client: add logging
s3_client: improve exception handling for chunked downloads
s3_client: fix indentation
s3_client: add max for client level retries
s3_client: remove `s3_retry_strategy`
s3_client: support high-level request retries
s3_client: just reformat `make_request`
s3_client: unify `make_request` implementation
Load-and-stream is broken when running concurrently to the finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes https://github.com/scylladb/scylladb/issues/26455.
Closesscylladb/scylladb#26456
* github.com:scylladb/scylladb:
test: Add reproducer for l-a-s and split synchronization issue
sstables_loader: Synchronize tablet split and load-and-stream
In 380f243986 we added support for rack
lists in replication options. Drivers which are not prepared to parse
that (as of now, all of them), will not create metadata object for
that keyspace. This breaks, for example, the "copy to/from" cqlsh
command. Potentially other things too.
To fix that, keep the "replication" column in the old format, and
store numeric RF there, which corresponds to the number of
replicas. Accurate options in the new format are put in
"replication_v2".
We set replication_v2 in the schema only when it differs from the old
"replication" so that the new column is not set during upgrade,
otherwise downgrade would fail. Partition tombstone is added to ensure
that pre-alter replication_v2 value is deleted on alters which change
replication to a value which is the same as the post-alter
"replication" value.
Fixes#26415Closesscylladb/scylladb#26429
Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case.
Fixes: scylladb/scylladb#26273
Marked for backport to 2025.4 as MVs are getting un-experimentaled there.
Closesscylladb/scylladb#26278
* github.com:scylladb/scylladb:
test: mv: add a test for tablet merge
tablet_allocator, tests: remove allow_tablet_merge_with_views injection
tablet_allocator: allow merges in base tables if rf-rack-valid=true
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes#26455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, the data returned by `database::get_tables_metadata()` and
`database::get_token_metadata()` may not be consistent. Specifically,
the tables metadata may contain some tablet-based tables before their
tablet maps appear in the token metadata. This is going to be fixed
after issue scylladb/scylladb#24414 is closed, but for the time being
work around it by accessing the token metadata via
`table`->effective_replication_map() - that token metadata is guaranteed
to have the tablet map of the `table`.
Fixes: scylladb/scylladb#26403Closesscylladb/scylladb#26588
shared_ptr<abstract_write_response_handler> instances are captured in
the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result,
an abstract_write_response_handler object may outlive its removal from
the _response_handlers map. We use write_handler_destroy_promise to
wait for such pending instances in cancel_write_handlers() and
cancel_all_write_response_handlers() to prevent use-after-free.
A better long-term solution might be to replace shared_ptr with
unique_ptr for abstract_write_response_handler and use a separate gate
to track the lmutate/rmutate lambdas. We do not actually need to wait
for these lambdas to finish before sending a timeout or error response
to the client, as we currently do in ~abstract_write_response_handler.
Fixesscylladb/scylladb#26355
The cancel_write_handlers() method was assumed to be called in a thread
context, likely because it was first used from gossiper events, where a
thread context already existed. Later, this method was reused in
abort_view_writes() and abort_batch_writes(), where threads are created
on the fly and appear redundant.
The drain_on_shutdown() method also used a thread, justified by some
"delicate lifetime issues", but it is unclear what that actually means.
It seems that a straightforward co_await should work just fine.
A strong pointer was held for the duration of thread::yield(),
preventing abstract_write_response_handler destruction and possibly
delaying the sending of timeout or error responses to the client.
This commit removes the strong pointer. Instead, we compute the
next iterator before calling timeout_cb(), so if the handler is
destroyed inside timeout_cb(), we already have a valid next iterator.
UserIdentity is a map of two fields in GetRecords responses, which
always has the same value. It may be missing, or contain a constant
object with value `{"type": "Service", "principalId":
"dynamodb.amazonaws.com"}`. Currently, the latter is set only for
`REMOVE`s triggered by TTL.
This commit introduces two new CDC operation types: `service_row_delete`
and `service_partition_delete`, emitted in place of `row_delete` and
`partition_delete`. Alternator Streams treats them as regular `REMOVE`s,
but in addition adds the `userIdentity` field to the record.
This change may break existing Scylla libraries for reading raw CDC
tables, but we doubt that anybody has this use case.
Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26121
Fixes https://github.com/scylladb/scylladb/issues/11523Closesscylladb/scylladb#26460
This patch implements the changes required by the Vector Store authorization, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/107085899/Vector+Store+Authentication+And+Authorization+To+ScyllaDB, that is:
- adding a new permission VECTOR_SEARCH_INDEXING, grantable only on ALL KEYSPACES
- allowing users with that permission to perform SELECT queries, but only on tables with a vector index
- increasing the number of scheduling groups by one to allow users to create a service level for a vector store user
- adjusting the tests and documentation
These changes are needed, as the vector indexes are managed by the external service, Vector Store, which needs to read the tables to create the indexes in its memory. We would like to limit the privileges of that service to a minimum to maintain the principle of least privilege, therefore a new permission, one that allows the SELECTs conditional on the existence of a vector_index on the table.
Fixes: VECTOR-201
Backport reasoning:
Backport to 2025.4 required as this can make upgrading clusters more difficult if we add it in 2026.1. As for now Scylla Cloud requires version 2025.4 to enable vector search and permission is set by orchestrator so there is no chance that someone will try to add this permission during upgrade. In 2026.1 it will be more difficult.
Closesscylladb/scylladb#25976
* github.com:scylladb/scylladb:
docs: adjust docs for VS auth changes
test: add tests for VECTOR_SEARCH_INDEXING permission
cql: allow VECTOR_SEARCH_INDEXING users to select
auth: add possibilty to check for any permission in set
auth: add a new permission VECTOR_SEARCH_INDEXING
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request
Add an option to retry S3 requests at the highest level, including
reinitializing headers and reauthenticating. This addresses cases
where retrying the same request fails, such as when the S3 server
rejects a timestamp older than 15 minutes.
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. That
was a bug and we fix it now.
Two reproducer tests is added for validation. They reproduce the problem
and don't pass before this commit.
Fixesscylladb/scylladb#26542
We move the code responsible for creating the schema for the underlying
materialized view of a secondary index from `index/` to `cql3/` so that
it's close to that responsible for performing `CREATE INDEX`. That's in
line with how other CQL statements are designed.
Note that the moved method is still a method of `secondary_index_manager`.
We'll make it a method of `create_index_statement` in the following
commit.
The main goal of this patch is to give more control over the creation
of the underlying view on an index to `create_index_statement.cc`.
That goal is in line with how the other statements are executed:
the schema is built in the cql3 module and only the ready schema_ptr
is passed further. That should also make the code cleaner and easier
to understand.
There are a few important things to note here:
* A call to `service::prepare_new_view_announcement` appears out of nowhere.
Aside from some validation checks and logging, that function does pretty
much the same as the pre-existing code we remove:
a. It creates Raft mutations based on the passed `view_ptr`.
b. It creates Raft mutations responsible for view building tasks.
c. It notifies about a new column family.
* We seemingly get rid of the code that creates view building tasks. That's not
true: we still do that via `service::prepare_new_view_announcement`.
That should explain why the change doesn't remove any relevant logic.
On the other hand, it might be more difficult to explain why moving the
code is correct. I'll touch on it below.
Before that, it may also be important to highlight that this commit only
affects the logic responsible for creating an index. There should be no
effect on any other part of how Scylla behaves.
---
Proving the correctness of the solution would take quite a lot of space,
so I'll only summarize it. It relies on a few things:
1. Two schema changes cannot happen in one operation. We allow for more
but only when those changes are dependent on each other and when
the additional ones are internal for Scylla, e.g. creating an index
leads to creating the underlying materialized view.
2. There are no entities or components that rely on indexes.
3. Each index is uniquely defined by the keyspace it belongs to
and the name of the index.
4. There is a bijection between rows in `system_schema.indexes`
and the currently existing indexes.
5. The name of an unnamed index depends on the name of the base table
and the names of the indexed columns. The name of an unnamed index
may have a number attached to it, but that number only depends on
the state of the schema at the time of creation of the index, and
it never changes later on. There are no other things the name of
an unnamed index depends on.
6. Scylla doesn't allow for changing any column in the base table
that has an index depending on it.
Based on that, we conclude that every existing index has exactly one
entry in `system_schema.indexes`, and the primary key of that entry
never changes.
The columns of `system_schema.indexes` that are not part of the primary
key are: `kind` and `options`. Both values are only decided at the time
of creation of an index, and currently there's no way to modify them.
That implies that there are only two events when an entry in the system
table can change: when creating an index and when dropping an index.
---
When we consider the previous place of the logic that this commit moves
to `cql3/statements/create_index_statement.cc`, it works like this:
1. We compare the sets of indexes defined on a specific table
(in the form of a structure called `index_metadata`) before and
after an operation.
2. We divide the entries into three sets: those present in both sets
and those present in only one of them.
3. We handle each of those three sets separately.
The structure `index_metadata` is a reflection of entries in
`system_schema.indexes`. It stores one more parameter -- `local` --
but its value depends on the other values of an entry, so we can ignore
it in this reasoning.
Because an index cannot be modified -- it can only be created or dropped
-- there are at most two non-empty sets: the set of new indexes and the
set of dropped indexes. Those sets are only non-empty during an operation
like `CREATE INDEX`, `DROP INDEX`, `DROP TABLE (base table)`,
`DROP KEYSPACE`. Note that it's impossible to drop an index by dropping
the underlying materialized view -- Scylla doesn't allow for that.
However, the code in `migration_manager.cc` we call
(`prepare_column_family_update_announcement`) and the code that we call
in `schema_tables.cc` (`make_update_table_mutations`) is only triggered
by *updates* related to the base table. In the context of `DROP TABLE`
or `DROP KEYSPACE`, we'd call `prepare_column_family_drop_announcement`
instead. In other words, we're only concerned with `CREATE INDEX` and
`DROP INDEX`.
---
A conclusion from this reasoning is that we only need to consider those
two situations when talking about correctness of this change. The impact
of this commit is that we may have potentially reordered mutations in the
resulting vector that will be applied to the Raft log.
The only mutations we may have reordered are the mutations responsible for
creating the underlying view and the mutations responsible for updating
columns in the base table. It's clear then that this commit brings no change
at all: we only give `cql3/statements/create_index_statement.cc` more
control over creating the underlying view.
---
We leave a remnant of the code in `db/schema_tables.cc` responsible
for dropping an index along with its underlying view. It would require
changing a bit more of the logic, and we don't need it for the rest
of this sequence of changes.
Refs scylladb/scylladb#16454
This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030
The test test_user_writes_rejection starts a 3-node cluster and
creates a large file on one of the nodes, to trigger the out-of-space
prevention mechanism, which should reject writes on that node.
It waits for the log message 'Setting critical disk utilization mode: true'
and then executes a write expecting the node to reject it.
Currently, the message is logged before the `_critical_disk_utilization`
variable is actually updated. This causes the test to fail sporadically
if it runs quickly enough.
The fix splits the logging into two steps:
1. "Asked to set critical disk utilization mode" - logged before any action
2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated
The tests are updated to wait for the second message.
Fixes https://github.com/scylladb/scylladb/issues/26004Closesscylladb/scylladb#26392
Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage.
Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers.
This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend.
Similarly with storage_options.
Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc).
Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends.
Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake.
Fixes#25359Fixes#26453Closesscylladb/scylladb#26186
* github.com:scylladb/scylladb:
docs::dev::object_storage: Add some initial info on GS storage
docs/dev: Add mention of (nested) docker usage in testing.md
sstables::object_storage_client: Forward memory limit semaphore to GS instance
utils::gcp::object_storage: Add optional memory limits to up/download
sstables::object_storage_client: Add multi-upload support for GS
utils::gcp::storage: Add merge objects operation
test_backup/test_basic: Make tests multiplex both s3 and gs backends
test::cluster::conftest: Add support for multiple object storage backends
boost::gcs_storage_test: reindent
boost::gcs_storage_test: Convert to use fixture
tests::boost: Add GS object storage cases to mirror S3 ones
tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS
tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env
sstables::object_storage_client: Add google storage implementation
test_services: Allow testing with GS object storage parameters
utils::gcp::gcp_credentials: Add option to create uninitialized credentials
utils::gcp::object_storage: Make create_download_source return seekable_data_source
utils::gcp::object_storage: Add defensive copies of string_view params
utils::gcp::object_storage: Add missing retry backoff increate
utils::gcp::object_storage: Add timestamp to object listing
utils::gcp::object_storage: Add paging support to list_objects
object_storage_client: Add object_name wrapper type
utils::gcp::object_storage: Add optional abort_source
utils::rest::client: Add abort_source support
sstables: Use object_storage_client for remote storage
sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial)
s3::upload_progress: Promote to general util type
storage_options: Abstract s3 to "object_storage" and add gs as option
sstables::file_io_extension: Change "creator" callback to just data_source
utils::io-wrappers: Add ranged data_source
utils::io-wrappers: Add file wrapper type for seekable_source
utils::seekable_source: Add a seekable IO source type
object_storage_endpoint_param: Add gs storage as option
config: break out object_storage_endpoint_param preparing for multi storage
It's not enough to consider only the current group 0 members. In the
Raft-based recovery procedure, there can be nodes that haven't joined
the current group 0 yet, but they have belonged to a different group 0
and thus have a non-empty group 0 state ID.
We fix this issue in this commit by considering topology members
instead.
We don't consider ignored nodes as an optimization. When some nodes are
dead, the group 0 state ID handler won't have to wait until all these
nodes leave the cluster. It will only have to wait until all these nodes
are ignored, which happens at the beginning of the first
removenode/replace. As a result, tombstones of group 0 tables will be
purged much sooner.
We don't rename the `group0_members` variable to keep the change
minimal. There seems to be no precise and succinct name for the used set
of nodes anyway.
We use `std::ranges::join_view` in one place because:
- `std::ranges::concat` will become available in C++26,
- `boost::range::join` is not a good option, as there is an ongoing
effort to minimize external dependencies in Scylla.
llvm recently updated [1] their coroutine debugging instructions.
They now recommend looking up the variable __coro_frame in the coroutine
function rather than constructing the name of the coroutine frame type
from the ramp function plus __coro_frame_ty.
Since the latter method no longer works with Clang 21 (I did not check
why), and since the former method is blessed as being more compatible,
switch to the recommended method. Since it works with both Clang 20 and
Clang 21, it future proofs the script.
[1] 6e784afcb5Closesscylladb/scylladb#26590
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.
Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.
The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.
The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.
Fixes#25913
Issue probably started after 9d3755f276, so backport to 2025.4
Closesscylladb/scylladb#26593
* github.com:scylladb/scylladb:
compaction: fix use after free when strategy is altered during compaction
compaction/twcs: pass compaction_strategy_state to internal methods
compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction
* tools/cqlsh ff3f572...f852b1f5 (2):
> Add LZ4 as a required package - so ScyllaDB Python driver could use LZ4 compression
> github actions: replace macos-13 with macos-15-intel
Closesscylladb/scylladb#26608
When requesting repair for tablets of a colocated table, the request
fails with an error. Improve the error message to show the table names
instead of table IDs, because the table names are more useful for users.
Fixesscylladb/scylladb#26567Closesscylladb/scylladb#26568
This patch series introduces several tests that check number of exceptions that happens during various replica operations. The goal is to have a set of tests that can catch situations where number of exceptions per operation increases. It makes exception throw regressions easier to catch.
The tests cover apply counter update and apply functionalities in the database layer.
There are more paths that can be checked, like various semaphore wait timeouts located deeper in the code. This set of tests does not cover all code paths.
Fixes#18164
This is an improvement. No backport needed.
Closesscylladb/scylladb#25992
* github.com:scylladb/scylladb:
test: cluster: test replica write timeout
database: parameterize apply_counter_update_delay_5s injector value
test: cluster: test replica exceptions - test rate limit exceptions
This PR adds operation per-table histograms to Alternator with item sizes involved in an operation, for each of the operations: `GetItem`, `PutItem`, `DeleteItem`, `UpdateItem`, `BatchGetItem`, `BatchWriteItem`. If read-before-write wasn't performed (i.e. it was not needed by the operation and the flag `alternator_force_read_before_write` was disabled), then we log sizes of the items that are in the request. Also, `UpdateItem` logs the maximum of the update size and the existing item size. We'll change it in a next PR.
Fixes: #25143Closesscylladb/scylladb#25529
* github.com:scylladb/scylladb:
alternator: Add UpdateItem and BatchWriteItem response size metrics
alternator: Add PutItem and DeleteItem response size metrics
alternator: Add BatchGetItem response size metrics
alternator: Add GetItem response size metrics
alternator/test: Add more context to test_metrics.py asserts
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.
Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.
The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.
Fixes#25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
During TWCS compaction, multiple methods independently fetch the
compaction_strategy_state using get_state(). This can lead to
inconsistencies if the compaction strategy is ALTERed while the
compaction is in progress.
This patch fixes a part of this issue by passing down the state to the
lower level methods as parameters instead of fetching it repeatedly.
Refs #25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.
Refs #26546
Refs #25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This patch introduces test `test_replica_database_apply_timeout`.
It tests timeout on database write. The test uses error injection
that returns timeout error if the injection `database_apply_force_timeout`
is enabled.
Refs #18164
Parameterize `apply_counter_update_delay_5s` injector value. Instead of
sleeping 5s when the injection is active, read parameter value that
specifies sleep duration. To reflect these changes, it is renamed to
`apply_counter_update_delay_ms` and the sleep duration is specified in
milliseconds.
Refs #18164
This patch introduces two tests for `replica::rate_limit_exception`.
One test is for write/apply limit, the other one for read/query limit.
The tests check the number of rate limit errors reported and the
number of cpp exceptions reported. If somebody adds an exception
throw on the rate limit paths, this test will catch it and fail.
Refs #18164
This PR contains various improvements in the recovery procedure
tests, mostly `test_raft_recovery_user_data`:
- decreasing the running time,
- some simplifications,
- making sure group 0 majority is lost when expected.
These are not critical test changes, so no need to backport.
Closesscylladb/scylladb#26442
* github.com:scylladb/scylladb:
test: assert that majority is lost in some tests of the recovery procedure
test: rest_client: add timeout support for read_barrier
test: test_raft_recovery_user_data: lose majority when killing one dc
test: test_raft_recovery_user_data: shutdown driver sessions
test: test_raft_recovery_user_data: use a separate driver connection for the write workload
test: test_raft_recovery_user_data: send ALTER KEYSPACE to any node
test: test_raft_recovery_user_data: bring failure_detector_timeout_in_ms back to 20 s
test: test_raft_recovery_user_data: speed up replace operations
test: stop/start servers concurrently in the recovery procedure tests
This is a minor refactoring aimed at reducing cognitive complexity of
`update_item_operation::apply`. The logic remains unchanged.
Closesscylladb/scylladb#25887
Clarified and expanded the documentation for the nodetool getendpoints command,
including detailed explanations of the --key and --key-components options.
Added examples demonstrating usage with simple and composite partition keys.
Closesscylladb/scylladb#26529
The test process like that:
- run long dns refresh process
- request for the resolve hostname with short abort_source timer - result
should be empty list, because of aborted request
The test sometimes finishes long dns refresh before abort_source fired and the
result list is not empty.
There are two issues. First, as.reset() changes the abort_source timeout. The
patch adds a get() method to the abort_source_timeout class, so there is no
change in the abort_source timeout. Second, a sleep could be not reliable. The
patch changes the long sleep inside a dns refresh lambda into
condition_variable handling, to properly signal the end of the dns refresh
process.
Fixes: #26561
Fixes: VECTOR-268
It needs to be backported to 2025.4
Closesscylladb/scylladb#26566
In the new API the biggest change is to implement the only
data_sink_impl::put(span<temporary_buffer>) overload.
Encrypted file impl and sstables compress sink use fallback_put() helper
that generates a chain of continuations each holding a buffer.
The counting_data_sink in transport had mostly been patched to correct
implementation by the previous patch, the change here is to replace
vector argument with span one.
Most other sinks just re-implement their put(vector<temporary_buffer>)
overload by iterating over span and non-preemptively grabbing buffers
from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The former helper is implemented like this:
future<> invoke_with_counting(fn) {
if (not_needed)
return fn();
return futurize_invoke(something).then([fn] {
return fn()
}).finally(something_else);
}
and all put() overloads are like
future<> put(arg) {
return invoke_with_counting([this, arg] {
return lower_sink.put(arg);
});
}
The problem is that with seastar API level 9, the put() overload will
have to move the passed buffers into stable storage before preempting.
In its current implementation, when counting is needed the
invoke_with_counting will link lower_sink.put() invocation to the
futurize_invoke(something) future. Despite "something" is
non-preempting, and futurize_invoke() on it returns ready future, in
debug mode ready_future.then() does preempt, and the API level 9 put()
contract will be violated.
To facilitate the switch to new API level, this patch rewrites one of
put() overloads to look like
future<> put(arg) {
if (not_needed) {
return lower_sink.put(arg);
}
something;
return lower_sink(arg).finally(something_else);
}
Other put()-s will be removed by next patch anyway, but this put() will
be patched and will call lower_sink.put() without preemption.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The API to put scattered_message into output_stream() is gone in seastar
API level 9, transport is the only place in Scylla that still uses it.
The change is to put the response as a sequence of temporary_buffer-s.
This preserves the zero-copy-ness of the reply, but needs few things to
care about.
First, the response header frame needs to be put as zero-copy buffer
too. Despite output_stream() supports semi-mixed mode, where z.c.
buffers can follow the buffered writes, it won't apply here. The socket
is flushed() in batched mode, so even if the first reply populates the
stream with data and flushes it, the next response may happen to start
putting the header frame before delayed flush took place.
Second, because socket is flushed in batch-flush poller, the temporary
buffers that are put into it must hold the foreigh_ptr with the response
object. With scattered message this was implemented with the help of a
delter that was attached to the message, now the deleter is shared
between all buffers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's going to be removed by next-after-next patch, but the next one
needs this overload implemented properly, so here it is.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.
Closesscylladb/scylladb#26116
* github.com:scylladb/scylladb:
schema: Allow configuring consistency setting for a keyspace
db: experimental consistent-tablets option
This patchset improves the atomicity and clarity of schema application in
the presence of token metadata updates during schema changes. The primary
focus is to ensure that changes to tablet metadata are applied atomically
as part of the schema commit phase, rather than being replicated to all
cores afterward, which previously violated atomicity guarantees.
Key changes:
- Introduced pending_token_metadata to unify handling of new and existing metadata.
- Split token metadata replication into prepare and commit steps.
- Abstracted schema dependencies in storage_service to support pending schema visibility.
- Applied tablet metadata updates atomically within schema commit phase.
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/24414Closesscylladb/scylladb#25302
* github.com:scylladb/scylladb:
db: schema_applier: update tablet metadata atomically
db: replica: move tables_metadata locking to commit
storage_service: abstract schema dependecies during token metadata update
storage_service: split replicate_to_all_cores to steps
db: schema_applier: unify token_metadata loading
replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
service: fix dependencies during migration_manager startup
db: schema_applier: move pending_token_metadata to locator
db: always use _tablet_hint as condition for tablet metadata change
db: refactor new_token_metadata into pending_token_metadata
db: rename new_token_metadata to pending_token_metadata
db: schema_applier: move types storage init to merge_types func
db: schema_applier: make merge functions non-static members
db: remove unused proxy from create_keyspace_metadata
This commit bundle introduces metrics on item sizes for Alternator operations.
The new metrics are:
- `operation_size_kib op=UpdateItem`: Tracks the size of an `UpdateItem`
operation. This is calculated as the sum of the existing item's size
plus the estimated size of the updated fields.
- `operation_size_kib op=BatchWriteItem`: Tracks the total size of items
within a `BatchWriteItem` request, aggregated on a per-table basis. If
an item already exists, the logged size is the maximum of the old and
the new item size.
NOTE: Both metrics rely on read-before-write, so if the
`alternator_force_read_before_write` option is disabled, these metrics
may be incomplete and report inaccurate sizes.
This commit bundle introduces metrics on item sizes for Alternator
operations. Specifically, this commit adds `operation_size_kb`
histograms for sizes of items created or replaced by the `PutItem`
operation, and sizes of items deleted by `DeleteItem` requests. The
latter needs a read-before-write, so the metrics may be incomplete if
`alternator_force_read_before_write` is disabled.
This commit bundle introduces metrics on item sizes for Alternator
operations. Specifically, this commit adds a `operation_size_kb`
per-table histogram, which contains item sizes in BatchGetItem requests.
A size of a BatchGetItem is the sum of the sizes of all items in the
operation grouped by table. In other words, a single BatchGetItem, and
BatchWriteItem for that matter, updates the histograms for each table
that it has items in.
This commit bundle introduces metrics on item sizes for Alternator
operations. Specifically, this commit adds a per-table
`operation_size_kb` histogram, recording the sizes of the items
contained in GetItem responses.
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.
We fix this issue and add a regression test in this PR.
Fixes#26569
This is an important bug fix, so it should be backported to all branches
with the Raft-based recovery procedure (2025.2 and newer branches).
Closesscylladb/scylladb#26572
* github.com:scylladb/scylladb:
test: test_raft_recovery_entry_loss: fix the typo in the test case name
test: verify that schema pulls are disabled in the Raft-based recovery procedure
raft topology: disable schema pulls in the Raft-based recovery procedure
Refactor `make_request` to use a single core implementation that
handles authentication and issues the HTTP request. All overloads now
delegate to this unified method.
The test test_mv_tablets_replace verifies that merging tablets of both a
view and its base table is allowed if rf-rack-valid-keyspaces option is
enabled (and it is enabled by default in the test suite).
The `allow_tablet_merge_with_views` error injection was previously used
to allow merging tablets in a table which has materialized views
attached to it. Now, the error injection is not needed because this is
allowed under the rf-rack-valid condition, which is enabled by default
in tests.
Remove the error injection from the code and adjust the tests not to use
it.
Tablet merge of base tables is only safe if there is at most one replica
in each rack. For more details on why it is the case please see
scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on,
this condition is satisfied, so allow it in that case.
Fixes: scylladb/scylladb#26273
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable. To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
The C++ test `test_indexing_paging_and_aggregation` is one of the slowest tests in test/boost. The reason for its slowness is that it needs a table with more rows than SELECT's "DEFAULT_COUNT_PAGE_SIZE" which was hard-coded to 10,000, so the test needed to write and read tens of thousands of rows, and did it multiple times.
It turns out the code actually had an ad-hoc mechanism to override DEFAULT_COUNT_PAGE_SIZE in a C++ test, but both this mechanism and the test itself were so opaque I didn't find it until I fixed it in a different way: What I ended up doing in this pull request is the following (each step in a separate patch):
1. Rewrite this test in Python, in the test/cqlpy framework. This was straightforward, as this test only used CQL and not internal interfaces. The reason why this test wasn't written in Python in the first place is that it was written in 2019, a year before cqlpy existed. A added extensive comments to the new tests, and I finally understood what it was doing :-)
2. I replaced the ad-hoc C++-test-only mechanism of overriding DEFAULT_COUNT_PAGE_SIZE by a bona-fide configuration parameter, `select_internal_page_size`.
3. Finally, the Python test can temporarily lower `select_internal_page_size` and use a table with much fewer rows.
After this series, the test `test_indexing_paging_and_aggregation` (which is now in Python instead of C++) takes around half a second, 20 times faster than before. I expect the speedup to be even more dramatic for the debug build.
Closesscylladb/scylladb#25368
* github.com:scylladb/scylladb:
cql: make SELECT's "internal page size" configurable
secondary index: translate test_indexing_paging_and_aggregation to Python
Before mutable_token_metadata_ptr containing tablet changes
was replicated to all cores in post_commit phase which violated
atomicy guarantee of schema_applier, now it's incorporated into
per shard commit phase.
It uses service::schema_getter abstraction introduced in earlier
commit to inject "pending" schema which is not yet visible to the
whole system.
The functions prepare_token_metadata_change and commit_token_metadata_change depend
on the current schema through calls to the database service. However, during an
atomic schema change, the current schema does not yet include the pending changes.
Despite that, we want to apply token metadata changes to those pending schema
elements as well.
Currently, this is achieved by postponing token metadata changes until after the rest
of the schema is committed, but this breaks atomicity. To allow incorporating the
prepare and commit phases into schema_applier, we need to abstract the schema
dependency. This will make it possible to provide, in following commits, an
implementation that includes visibility into pending changes, not just the currently
active schema.
Make use of the freshly introduced facility to disable
garbage-collection on a per-query basis for range scans. This is needed
so partitions that only contain garbage-collectible data are not missing
from the partition-list. When using SELECT * FROM MUTATION_FRAGMENTS(),
the user is expecting to see *all* data, even that which is dead and
garbage-collectible.
Include a test which reproduces the issue.
Allow disabling tombstone gc on a per-query basis for mutation queries.
This is achieved by a bool flag passed to mutation query variants like
`query_mutations_on_all_shards()` and `database::mutation_query()`,
which is then propagated down to compaction_mutation_state.
The future user (in the next patch) is the SELECT * FROM
MUTATION_FRAGMENTS() statement which wants to see dead partitions
(and rows) when scanning a table. Currently, due to garbage collections,
said statement can miss partitions which only contain
garbage-collectible tombstones.
It is confusing. For query compaction, it initialized to `always_gc`,
for sstable compaction it is initialized to a lambda calling into
`can_gc()`. This makes understanding the purpose of this member very
confusing.
The real use of this member is to bridge
mutation_partition::compact_and_expire() with can_gc(). This patch
ditches the member and creates the lambda near the call sites instead,
just like the other params to `compact_and_expire()` already are.
can_gc() now also respects _tombstone_gc.is_gc_enabled() instead of just
blindly returning true when in query mode.
With this patch, whether tombstones are collected or not in query mode
is now consistent and controlled by the tombstone_gc_state.
Currently, to disable tombstone-gc on-demand completely, one has to pass
down a bool flag along with the already required tombstone_gc_state to
the code which does the compacting.
This is redundant and confusing, the tombstone_gc_state is supposed to
encapsulate all tombstone-gc related logic in a transparent way.
Add dedicated factory methods for no-gc and gc-all, to allow creating a
tombstone_gc_state which transparently gcs for all or no tombstones.
This commit adds tests to `test_streams.py` (i.e. Alternator Streams)
checking the following cases:
* putting an item with BatchWriteItem shouldn't emit a log if the old
item and the new item are identical,
* deleting an item with BatchWriteItem shouldn't emit a log if the item
doesn't exist,
* UpdateItem shouldn't emit a log if the old item and the new item are
identical.
These cases haven't been tested until this commit.
Refs https://github.com/scylladb/scylladb/issues/6918Closesscylladb/scylladb#26396
* seastar 270476e7...bd74b3fa (20):
> memory: Decay large allocation warning threshold
> iotune: fix very long warm up duration on systems with high cpu count
> Add lib info to one line backtrace
> io: Count and export number of AIO retries
> io_queue: Destroy priority class data with scheduling group
> Merge 'Expell net::packet from output_stream API stack' from Pavel Emelyanov
code: Introduce new API level
iostream: Remove write()-s of packet/scattered_message from new API level
iostream: Convert output_stream::_zc_bufs to vector of buffers
code: Add data_sink_impl::put(std::span<temporary_buffer>) method
code: Prepare some data_sink_impl::do_put(temporary_buffer) methods
iostream: Introduce output_stream::write(span<temporary_buffer>) overload
packet: Add packet(std::span<temporary_buffer>) constructor
temporary_buffer: Add detach_front() helper
> cooking: update gnutls to 3.7.11
> file: Configure DMA alignment from block size
> util: adapt to fmt 12.0.0 API changes
> Merge 'Internalize reactor::posix_... API methods' from Pavel Emelyanov
reactor: Deprecate and internalize posix_connect()
reactor: Deprecate and internalize posix_listen()
> cooking: update fmt to modern version
> Merge 'Add prometheus bench, coroutinize prometheus' from Travis Downs
prometheus: coroutinize metrics writing
prometheus_test: add global label test
introduce metrics_perf bench
> operator co_await: use rvalue reference
> futurize::invoke: use std::invoke
> io_tester: Don't skip 0 position in sequential workflows
> io_queue: Use own logger for messages
> .clangd: tell the LSP about seastar's header style
> docker: Update to plucky
> Merge 'Convert timer test into seastar test (and a bit more)' from Pavel Emelyanov
test: Remove OK macro
test: Fix one failure check
test: Use boost checkers instead of BUG() macro
test: Fix indentation after previous patch
test: Convert timer_test into seastar test(s)
Closesscylladb/scylladb#26560
In some uses of SELECT, such as aggregation (sum() et al.), GROUP BY or
secondary index, it needs to perform internal scans. It uses an "internal
page size" which before this patch was always DEFAULT_COUNT_PAGE_SIZE = 10000.
There was an ad-hoc and undocumented way to override this default in C++
tests, using functions in test/lib/select_statement_utils.hh, but it
was so non-obvious that the test that most needed to override this
default - the very slow test test_indexing_paging_and_aggregation which
would have been must faster with a lower setting - never used it.
So in this patch we replace the ad-hoc configuration functions by a
bona-fide Scylla configuration option named "select_internal_page_size".
The few C++ tests that used the old configuration functions were
modified to use the new configuration parameters. The slow test
test_indexing_paging_and_aggregation still doesn't use the new
configuration to become faster - we'll do this in the next patch.
Another benefit of having this "internal page size" as a configuration
option is that one day a user might realize that the default choice
10,000 is bad for some reason (which I can't envision right now), so
having it configurable might come it handy.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.
The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.
Fixes#26569
The Boost test test_indexing_paging_and_aggregation is one of the slowest
boost tests. But it's hard to understand why it needs to be so slow - the
C++ test code is opaque, and uncommented. The test didn't need to be in
C++ - it only uses CQL, not any internal interfaces - but it was written
in 2019, a year before test/cqlpy was created.
So before we can make this test faster, this patch translates it to
Python and adds significant amount of comments. The new Python test is
functionally identical to the old C++ test - it is not (yet) made
smaller or faster. The new test takes a whopping 9 seconds to run on
my laptop (in dev build mode). We'll reduce that in the next patch.
As usual, the cqlpy test can also be tested on Cassandra, and
unsurprisingly, it passes.
Refs #16134 (which asks to translate more MV and SI tests to Python).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a struct `per_request_options` used to communicate between CDC and upper abstraction layers. We need this for better compatibility with DynamoDB Streams in Alternator (https://github.com/scylladb/scylladb/issues/6918) to change operation types of log rows. This patch also adds a way to conditionally forward the item read by LWT to CDC and use it as a preimage. For now, only Alternator uses this feature.
The main changes are:
- add a struct `cdc::per_request_options` to pass information between CDC and upper abstraction layers,
- add the struct to `cas_request::apply`'s signature,
- add a possibility to provide a preimage fetched by an upper abstraction layer (to propagate a row read by Alternator to CDC's preimage). This reduces the number of reads-before-write by 1 for some **Alternator** requests and it is always safe. It's possible to use this feature also in CQL.
No backport, it's a feature.
Refs https://github.com/scylladb/scylladb/issues/6918
Refs https://github.com/scylladb/scylladb/pull/26121Closesscylladb/scylladb#26149
* github.com:scylladb/scylladb:
alternator, cdc: Re-use the row read by LWT as a CDC preimage
cdc: Support prefetched preimages
storage: Add cdc options to cas_request::apply
cdc, storage: Add a struct to pass per-mutation options to CDC
cdc: Move operations enum to the top of the namespace
Tiny code cleanup to improve readability without changing behavior.
Changes:
- remove unused variables and imports,
- remove redundant whitespaces, and a duplicated `public:` access
specifier,
- use `is_aws` function to check if running in AWS
test/alternator/test_metrics.py,
- other trivial changes.
Closesscylladb/scylladb#26423
Unfortunately, the test became flaky and is blocking promotion. The
cause of the flaky is not known yet but unrelated to other items
currently queued on the `next` branch. The investigation continues on
GitHub issue scylladb/scylladb#26534.
In the meantime, skip the test to unblock other work.
Refs: scylladb/scylladb#26534Closesscylladb/scylladb#26549
Into total and live. Currently only live (those with live content) are
counted. Report live and total seprately, just like we do for rows. This
allows deducing the count of dead partitions as well, which is
particularly interesting for scans.
Closesscylladb/scylladb#26548
We need to avoid reloading schema early as it goes via
schema_applier which internally depends on storage_service
and on distribued_loader initializing all keyspaces.
Simply moving migration manager startup later in the code is not
easy as some services depend on it being initialized so we just
enable those feature listeners a bit later.
It never belonged to tables and views and its placement stems
from location of _tablet_hint handling code.
In the follwing commits we'll reference it in storage_service.cc.
It prepares pending_token_metadata to handle both new and copy
of existing metadata for consistent usage in later commit.
It also adds shared_token_metatada getter so that we don't
need to get it from db.
This is mechanical change which simplifies the code. Schema_applier
class is an object which holds schema merging intermediate state
so it's fine that all schema merging functions have access to this state.
Introduce a counter metric to monitor instances where the background
filling fiber is blocked due to insufficient memory in the S3 client.
Closesscylladb/scylladb#26466
Propagates the row read by CAS to CDC's preimage to save one
read-before-write.
As of now, a preimage in Alternator Streams always contains the entire
item (see previous_item_read_command in executor.cc), so the resulting
preimage should stay the same. In other words, this change should be
transparent to users.
This commit adds support to pass a preimage selected by an upper layer
to CDC. The responsibility for the correctness of the preimage (i.e. the
selected columns, whether it's up to date, etc.) lies with the caller.
It may be improved in the future by validating the preimage, e.g. by
"slicing" the received preimage to the necessary columns.
The motivation behind this change was to reduce the number of
read-before-writes and avoid reading the row twice for Alternator
Streams in an increased compatibility mode with DynamoDB. This is to be
added in a following commit. Until now, this commit should be a no-op.
During Scylla startup, directories are created and verified in
`directories::do_verify_owner_and_mode()`. It is possible that while
retrieving file stats, a file might be removed, leading to Scylla
failing to boot.
This is particularly visible in `storage/test_out_of_space.py` tests,
which use FUSE to mount size-limited volumes. When a file that is open
by another process is removed, FUSE renames it to `.fuse_hidden*`.
In `directories::do_verify_owner_and_mode()`, the code performs a
`scan_dir` to list files and retrieves their stats to verify type, mode,
and ownership. If a file is removed while retrieving its stats, we see
errors such as:
```
Failed to get /scylladir/testlog/x86_64/dev/volumes/e0125c60-1e63-4330-bf6f-c0ea3e466919/scylla-0/hints/1/.fuse_hidden0000001800000005
```
This change makes `do_verify_owner_and_mode()` ignore files when
retrieving stats fails, avoiding spurious errors during verification.
Refs: https://github.com/scylladb/scylladb/issues/26314Closesscylladb/scylladb#26535
Not supported currently as such tables have no memtables, cache or
sstables, so any select * from mutation_fragments() query will return
empty result.
Detect virtual tables and add return their content with a distinct
'virtual-table' mutation_source designation.
Add new --input-format command line argument. Possible values are json
(current) and cql (new -- added in this patch).
When --input-format=cql (new default), the input-file is expected to
contain CQL INSERT, UPDATE or DELETE statements, separated by semicolon.
The input file can contain any number of statements, in any order. The
statements will be executed and applied to a memtable, which is then
flushed to create an sstable with the content generated from the
statement. The memtable's size is capped at 1MiB, if it reaches this
size, it is flushed and recreated. Consequently, multiple sstables can
be created from a single scylla-sstable write --input-format=cql
operation.
We include more relevant information for debugging purposes:
the remaining bytes and the size. It might be useful to determine
where exactly an error occurred and help reason about it.
Closesscylladb/scylladb#26486
This patch makes three small mostly-cosmetic improvements to a test in
test/alternator/test_streams.py:
1. The test is renamed "test_streams_deleteitem_old_image_no_ck" to
emphasize its focus on the combination of deleteitem, old image,
and no ck. The "putitem" we had in the name was not relevant, and
the "old_image" was missing and important.
2. Moreover, using PutItem in this test just to set up the test scenario
mixed the bug which the test tries to reproduced with a different
only-recently-fixed bug (that PutItem also generated a spurious
"REMOVE" event). So I changed the use of PutItem by using UpdateItem,
to make this test indepedent of the other bug. Test independence is
important because it allows us - if we want - to backport a fix for
just one bug independently of the fix to the other bug.
3. Also improved the comment in front of the test to mention where we
already tested the with-ck case, and also to mention issue 26382
which this test reproduces (the xfail line also mentions it, but
the xfail line will be removed when the bug is fixed - but the
mention in the comment will remain - and should remain.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26526
This series of patches improves test vector_store_client_test stability. The primary issue with flaky connections was discovered while working on PR #26308.
Key Changes:
- Fixes premature connection closures in the mock server:
The mock HTTP server was not consuming request payloads, causing it to close connections immediately after a response. Subsequent tests attempting to reuse these closed connections would fail intermittently, leading to flakiness. The server has been updated to handle payloads correctly.
- Removes a retry workaround:
With the underlying connection issue resolved, the retry logic in the vector_store_client_test_ann_request test is no longer needed and has been removed.
- Mocks the DNS resolver in tests:
The vector_store_client_uri_update_to_invalid test has been corrected to mock DNS lookups, preventing it from making real network requests.
- Corrects request timeout handling:
A bug has been fixed where the request timeout was not being reset between consecutive requests.
- Unifies test timeouts:
Timeouts have been standardized across the test suite for consistency.
Fixes: #26468
It is recommended to backport this series to the 2025.4 branch. Since these changes only affect test code and do not alter any production logic, the backport is safe. Addressing this test flakiness will improve the stability of the CI pipeline and prevent it from blocking unrelated patches.
Closesscylladb/scylladb#26374
* github.com:scylladb/scylladb:
vector_search: Unify test timeouts
vector_search: Fix missing timeout reset
vector_search: Refactor ANN request test
vector_search: Fix flaky connection in tests
vector_search: Fix flaky test by mocking DNS queries
In the next patches a new input-format will be introduced, which can
produce multiple output format. To prepare for this, consolidate the
code which produces an sstable into a reusable lambda function.
Moves code around, reduces churn in next patches. Indentation is left
broken for easier review.
Make error messages more generic, so they are not specific to select.
Make it a template on the type of cql statement for the final check. To
avoid templating the whole thing, the function is split into two.
Parametrize the name of the allowed statement types in said check.
Prepares the method to be shared between query operation and write
operation (future change).
While at it, also change query param type to std::string_view to avoid
some copies.
This transformation enables an existing schema to be created as a table
in cql_test_env, to be used to read/write sstables belonging to said
schema.
Extract this into a method, to be shared by a future operation which
will also want to do this.
Augments the object storage document with config options etc for
using GS instead of S3.
TODO: add proper gsutil command line examples for manual managing of
GCP storage.
Adds optional memory semaphore to limit the mem buffer usage in sink/source.
Note that we don't bookkeep exact, to avoid deadlock issues in higher layer.
In upload, we overlease on first buffer put to ensure we can at least fill
the desired 8M of buffers. We try to adjust when going over, but if we
fail, we fail, but at least will initiate upload -> soon release memory.
On next put, we try to grab multiples of 8M again, and so forth. Thus
potentially causing waiting for resources, without ending up not uploading
at least one active sink.
For download (source), we try to get lease for as much as we want to read,
but if we fail, we adjust this down to 256k and download anyway. Since this
will typically be released immediately, we at least don't overrun for long,
and again, avoid fully stopping, throttling rate instead.
Adds an `object_storage` fixture with paramterization to iterate through
's3' and 'gs' backends.
For the former, will instansiate the `s3_server` backend (modified to better
handle being actual temp, function level server).
For the latter, will either give back a frontend if env vars indicating
"real" GS buckets and endpoints are used, or launch a docker image for
fake-gcs-server on a free port.
Please read the comment in the code about the management of server output,
as this is less than optimal atm, but I can't figure out the issue with it.
All returned fixture objects will respond to `address`, `bucket` properties,
as well as be able to create endpoint config objects for scylla.
To avoid having to async wait for creating credentials, allow lazy
init (in actual token renew) of credentials. This is not super
pleasant, since it means any error will be late, but it is required
more or less for the code paths into which we intend to place this.
Since, given the nature of object storage API:s, it is no more complicated to
provide a reasonable implementation of a seekable, limited, interface,
give this back, which in turn means upper layers can provide easy read-only file
interfaces. Hint hint.
Since both are bucket+prefix oriented, we can basically use same
options for both, only distinguished by actual protocol.
Abstract the types and the helper parse etc routines to handle either.
Use "gs" as term for gcs (google compute storage), since this is the
URL scheme used.
Because the concept of pushing reading range does not work for the wrapping
we do (i.e. encryption), there is no point having it here. We need to do
said range handling higher up.
Also, must allow multi-layered wrapping.
Extension of data_source, with the ability to
a.) Seek in any direction, i.e. move backwards. Thus not pure stream.
b.) Read a limited number of bytes.
The very transparent reason for the interface is to have a base
abstraction for providing a read-only file layer for networked
resources.
Moves the config wrapper to own file (to reduce recompilation for modifying)
and refactors to handle extending this parameter to non-s3 endpoint configs.
We add missing `const`-qualifiers wherever possible in the module.
A few smaller changes were included as a bonus.
Backport: not needed. This is a cleanup.
Closesscylladb/scylladb#26485
* github.com:scylladb/scylladb:
index/secondary_index_manager: Take std::span instead of std::vector
index/secondary_index_manager: Add missing const qualifier
index/vector_index: Add missing const qualifiers
cql3/statements/index_prop_defs.cc: Remove unused include
cql3/statements/index_prop_defs.cc: Mark function as TU-local
cql3/statements/index_prop_defs: Mark methods as const-qualified
ZSTD_CDict needs a big contiguous allocation and there's no way around that.
The only thing to do is relax the warning appropriately.
Closesscylladb/scylladb#25393
So tombstones can be purged correctly based on the tombstone gc mode.
Currently if repair-mode is used, tombstones are not purged at all,
which can lead to purged tombstone being re-replicated to replicas which
already purged them via read-repair.
This is not a correctness problem, tombstones are not included in data
query resutl or digest, these purgable tombstone are only a nuissance
for read repair, where they can create extra differences between
replicas. Note that for the read repair to trigger, some difference
other than in purgable tombstones has to exist, because as mentioned
above, these are not included in digets.
Fixes: scylladb/scylladb#24332Closesscylladb/scylladb#26351
Batches that fail on the initial send are retired later, until they
succeed. These retires happen with CL=ALL, regardless of what the
original CL of the batch was. This is unnecessarily strict. We tried to
follow Cassandra here, but Cassandra has a big caveat in their use of
CL=ALL for batches. They accept saving just a hint for any/all of the
endpoints, so a batch which was just logged in hints is good enough for
them.
We do not plan on replicating this usage of hints at this time, so as a
middle ground, the CL is changed to EACH_QUORUM.
Fixes: scylladb/scylladb#25432Closesscylladb/scylladb#26304
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.
Fixesscylladb/scylladb#24982Closesscylladb/scylladb#26472
The `describe_multi_item` function treated the last reference-captured
argument as the number of used RCU half units. The caller
`batch_get_item`, however, expected this parameter to hold an item size.
This RCU value was then passed to
`rcu_consumed_capacity_counter::get_half_units`, treating the
already-calculated RCU integer as if it were a size in bytes.
This caused a second conversion that undercounted the true RCU. During
conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH`
(=4KB), so the double conversion divided the number of bytes by 16 MB.
The fix removes the second conversion in `describe_multi_item` and
changes the API of `describe_multi_item`.
Fixes: https://github.com/scylladb/scylladb/pull/25847Closesscylladb/scylladb#25842
The `vector_store_client_test` could be flaky because the request timeout
was not consistently reset in all code paths. This could lead to a
timeout from a previous operation firing prematurely and failing the
test.
The fix ensures `abort_source_timeout` is reset before each request.
The implementation is also simplified by changing
`abort_source_timeout::reset` that combines the reset and arm
operations into a same invocation.
Refactor the `vector_store_client_test_ann_request` test to use the
`vs_mock_server` class, unifying the structure of the test cases.
This change also removes retry logic that waited for the server to be ready.
This is no longer necessary because the handler now exists for all index names
and consumes the entire request payload, preventing connection closures.
Previously, the server did not handle requests for unconfigured
indexes, which caused the connection to close. This could lead to a
race condition where the client would attempt to reuse a closed
connection.
The vector store mock server was not reading the ANN request body,
which could cause it to prematurely close the connection.
This could lead to a race condition where the client attempts to reuse a
closed connection from its pool, resulting in a flaky test.
The fix is to always read the request body in the mock server.
The `vector_store_client_uri_update_to_invalid` test was flaky because
it performed real DNS lookups, making it dependent on the network
environment.
This commit replaces the live DNS queries with a mock to make the test
hermetic and prevent intermittent failures.
`vector_search_metrics_test` test did not call configure{vs},
as a consequence the test did real DNS queries, which made the test
flaky.
The refreshes counter increment has been moved before the call to the resolver.
In tests, the resolver is mocked leading to lack of increments in production code.
Without this change, there is no way to test DNS counter increments.
The change also simplifies the test making it more readable.
Expecting the group 0 read barrier to succeed with a timeout of 1s, just
after restarting 3 out of 5 voters, turned out to be flaky. In some
unlikely scenarios, such as multiple vote splits, the Raft leader
election could finish after the read barrier times out.
To deflake the test, we increase the timeout of Raft operations back to
300s for read barriers we expect to succeed.
Fixes#26457Closesscylladb/scylladb#26489
Using the name regular as the incremental mode could be confusing, since
regular might be interpreted as the non-incremental repair. It is better
to use incremental directly.
Before:
- regular (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
After:
- incremental (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
Fixes#26503Closesscylladb/scylladb#26504
Using `driver_connect()` after a cluster restart isn't enough to ensure
full CQL availability, but the test assumes that it is.
Fix that by making the test wait for CQL availability via `get_ready_cql()`.
Also, replace some manual usages of wait_for_cql_and_get_hosts with
`get_ready_cql()` too.
Fixesscylladb/scylladb#25362Closesscylladb/scylladb#25366
db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.
This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like https://github.com/scylladb/scylladb/issues/26403).
Fixes https://github.com/scylladb/scylladb/issues/26417
The patch should be backported to 2025.4
Closesscylladb/scylladb#26446
* github.com:scylladb/scylladb:
db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
db/view/view_building_worker: futurize and rename `start_background_fibers()`
There was a race between loop in `view_building_worker::run_view_building_state_observer()`
and a moment when a batch was finishing its work (`.finally()` callback
in `view_building_worker::batch::start()`).
State observer waits on `_vb_state_machine.event` CV and when it's
awoken, it takes group0 read apply mutex and updates its state. While
updating the state, the observer looks at `batch::state` field and
reacts to it accordingly.
On the other hand, when a batch finishes its work, it sets `state` field
to `batch_state::finished` and does a broadcast on
`_vb_state_machine.event` CV.
So if the batch will execute the callback in `.finally()` while the
observer is updating its state, the observer may miss the event on the
CV and it will never notice that the batch was finished.
This patch fixes this by adding a `some_batch_finished` flag. Even if
the worker won't see an event on the CV, it will notice that the flag
was set and it will do next iteration.
Fixesscylladb/scylladb#26204Closesscylladb/scylladb#26289
In f828fe0d59 ("setup: add the lazytime XFS version") we added the
lazytime mount option to /var/lib/scylla, but it was quickly reverted
(8f5e80e61a) as it caused a regression on CentOS 7.
We reinstate it now with a kernel version check. This will avoid
the lazytime mount option on CentOS 7, which is unsupported anyway.
The lazytime option avoids marking the inode as dirty if it's only for the
purpose of updating mtime/ctime. This won't help much while writing sstables
(since the write also updates extent information), but may help a little
with with commitlog writes, since those are pure overwrites.
It likely won't help with the RWF_NOWAIT violations seen in [1], since
those are likely due to in-memory locking, not flushing dirty inodes
to disk.
Tested with an install to Ubuntu 24.04 LTS followed by a scylla_setup run.
The lazytime option was added the the .mount file and showed up in
the live mount.
[1] https://github.com/scylladb/seastar/issues/2974
Closes scylladb/scylladb#26436
Fixes#26002
This will allow us to communicate with CDC from higher layers. We plan
to use it to reduce the number of read-before-writes with preimages by
passing the row selected in upper layers.
The test uses CQL tracing to check which files were read by a query.
This is flaky if the coordinator and the replica are different shards,
because the Python driver only waits for the coordinator, and not
for replicas, to finish writing their traces.
(So it might happen that the Python driver returns a result
with only coordinator events and no replica events).
Let's just dodge the issue by using --smp=1.
Fixesscylladb/scylladb#26432Closesscylladb/scylladb#26434
We noticed during work on scylladb/seastar#2802 that on i7i family
(later proved that it's valid for i4i family as well),
the disks are reporting the physical sector sizes incorrectly
as 512bytes, whilst we proved we can render much better write IOPS with
4096bytes.
This is not the case on AWS i3en family where the reported 512bytes
physical sector size is also the size we can achieve the best write IOPS.
This patch works around this issue by changing `scylla_io_setup` to parse
the instance type out of `/sys/devices/virtual/dmi/id/product_name`
and run iotune with the correct request size based on the instance type.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#25315
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.
the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.
Fixes https://github.com/scylladb/scylladb/issues/25290
backport possible to improve stability
Closesscylladb/scylladb#26180
* github.com:scylladb/scylladb:
service/qos: set long timeout for auth queries on SL cache update
auth: add query_state parameter to query functions
auth: refactor query_all_directly_granted
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.
This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like scylladb/scylladb#26403).
Fixesscylladb/scylladb#26417
Next commit will move `discover_existing_staging_sstables()`
to the foreground, so to prepare for this we need to futurize
`start_background_fibers()` method and change its name to better reflect
its purpose.
`sl:driver` is expected to be used for new and control connections,
but other connections that run user load should not use it after
the user is authenticated.
Refs: scylladb/scylladb#24411
Before `sl:driver` was introduced, service levels were assigned as
follows:
1. New connections were processed in `main`.
2. After user authentication was completed, the connection's SL was
changed to the user's SL (or `sl:default` if the user had no SL).
This commit introduces `service_level_state` to `client_state` and
implements the following logic in `transport/server`:
1. If `sl:driver` is not present in the system (for example, it was
removed), service levels behave as described above.
2. If `sl:driver` is present, the flow is:
I. New connections use `sl:driver`.
II. After user authentication is completed, the connection's SL is
changed to the user's SL (or `sl:default`).
III. If a REGISTER (to events) request is handled, the client is
processing the control connection. We mark the client_state
to permanently use `sl:driver`.
The aforementioned state `2.III` is represented by
`_control_connection` flag in `client_state`.
Fixes: scylladb/scylladb#24411
Before this change, unauthorized connections stayed in `main`
scheduling group. It is not ideal, in such case, rather `sl:default`
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.
This commit adds a call to `update_scheduling_group` at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to `sl:default`.
Fixes: scylladb/scylladb#26040
Before this change, new connections were handled in a default
scheduling group (`main`), because before the user is authenticated
we do not know which service level should be used. With the new
`sl:driver` service level, creation of new connections can be moved to
`sl:driver`.
We switch the service level as early as possible, in `do_accepts`.
There is a possibility, that `sl:driver` will not exist yet, for
instance, in specific upgrade cases, or if it was removed. Therefore,
we also switch to `sl:driver` after a connection is accepted.
Refs: scylladb/scylladb#24411
Driver service level is a special service level that is created
automatically by the system. Therefore, it requires special handling
in DESC SCHEMA WITH INTERNALS and those test verifies the special
behavior.
Refs: scylladb/scylladb#24411
This commit:
- Increases the number of allowed scheduling groups to allow the
creation of `sl:driver`.
- Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating
`sl:driver` until all nodes have increased the number of
scheduling groups.
- Starts using `get_create_driver_service_level_mutations`
to unconditionally create `sl:driver` on
`raft_initialize_discovery_leader`. The purpose of this code
path is ensuring existence of `sl:driver` in new system and tests.
- Starts using `migrate_to_driver_service_level` to create `sl:driver`
if it is not already present. The creation of `sl:driver` is
managed by `topology_coordinator`, similar to other system keyspace
updates, such as the `view_builder` migration. The purpose of this
code path is handling upgrades.
- Modifies related tests to pass after `sl:driver` is added.
Later in this patch series, `sl:driver` will be used by
`transport/server` to handle selected traffic, such as the driver's
schema and topology fetches.
Refs: scylladb/scylladb#24411
This commit implements `get_create_driver_service_level_mutations`
and `migrate_to_driver_service_level` in service_level_controller.
Both methods create `sl:driver` with shares=200 and store this fact
in `system.scylla_local`. Both methods will be used later in this
patch series for automatic creation of sl:driver.
Refs: scylladb/scylladb#24411
Later in this patch series, `sl:driver` will be added as a special
service level created automatically by the system. It needs special
handling in `DESC SCHEMA ...` to ensure that during backup restore:
1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists
2. If `sl:driver` exists, its configuration is fully restored (emit
ALTER SERVICE LEVEL).
3. If `sl:driver` was removed, the information is retained (emit
DROP SERVICE LEVEL instead of CREATE/ALTER).
Refs: scylladb/scylladb#24411
This adds a reference to sl_controller so that, later in this patch
series, topology_coordinator can manage creating `sl:driver` once
group0 is fully operational.
Refs: scylladb/scylladb#24411
This commit extends sytem.scylla_local table with an additional
key/value pair that can be used later in this patch series to
keep an information that `sl:driver` was already created. The purpose
of storing this information is to ensure that `sl:driver` is
not recreated after being intentionally removed.
A new mutation is included in `register_raft_pull_snapshot` to keep
`service_level_driver_created` in state machine shapshot, which is
required for proper propagation of the value when a new node is added
to the cluster.
Refs: scylladb/scylladb#24411
Previously, tests used the hardcoded value 7 for the maximum number of
user service levels. This commit introduces a named variable that can
be shared across tests to avoid cases where this magic number goes
out of sync.
The voter handler caused `test_raft_recovery_user_data` to stop losing
group 0 majority when expected. We make sure this won't happen again
in this commit.
We don't change `test_raft_recovery_entry_lose` because it has some
checks that would fail with group 0 majority (schema versions would
match).
Note that it's possible to timeout the read barrier quickly without the
`timeout` parameter. See e.g. `test_cannot_add_new_node` in
`test_raft_no_quorum.py`. We don't take this approach here because we
don't want to change the default Raft parameters in the recovery
procedure tests.
After introducing the voter handler, the test stopped losing group 0
majority when expected because the killed dc contained 2 out of 5
voters. We fix it in this commit. The fix relies on the voter handler
not doing unnecessary work. The first dc should keep its voters and
majority.
The test was functional even though majority wasn't lost when expected.
Stopping the recovery leader before restarting it with `recovery_leader`
caused majority loss in the old group 0. Hence, there is no need to
backport this commit.
Shutting down `ccluster_all_nodes` in the previous commit is necessary
to avoid flakiness. It turns out that leaked driver sessions can impact
another run of the test case (with different parameterization). Here,
without shutting down `ccluster_all_nodes`, we could observe the DDL
requests from `start_writes` fail in the second test case run
(where `remove_dead_nodes_with == "replace"`) like this:
```
> await cql.run_async(f"USE {ks_name}")
E cassandra.cluster.NoHostAvailable: ('Unable to complete the
operation against any hosts', {<Host: 127.46.35.70:9042 dc1>:
ConnectionException('Host has been marked down or removed'),
<Host: 127.46.35.71:9042 dc1>: ConnectionException('Host has
been marked down or removed'), <Host: 127.46.35.3:9042 dc1>:
ConnectionException('Host has been marked down or removed'),
<Host: 127.46.35.25:9042>: ConnectionException('Host has
been marked down or removed')})
```
We could also see errors like this on the driver:
```
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query]
message="Keyspace 'test_1759763911381_oktks' does not exist"
```
It turned out that `test_1759763911381_oktks` was created in the first
test case run (where `remove_dead_nodes_with == "remove"), and somehow
the driver session created in the second test case run was still using
this keyspace in some way. The DDL requests were failing on the Scylla
side with the error above, and after some retries, the driver marked
nodes as down. I didn't try to investigate what exactly the driver was
doing.
In this commit, we shut down other driver sessions used in this test.
They didn't cause problems so far, but we'd better use the Python driver
correctly and be safe.
It's simpler than pausing the workload for the `cql` reconnection.
Moreover, the removed `start_writes` call required group 0 majority for
(redundant) CREATE KEYSPACE IF NOT EXISTS and CREATE TABLE IF NOT EXISTS
statements. The test shouldn't have group 0 majority at that point,
which is fixed in one of the following commits.
Using a separate driver connection also allows us to call
`finish_writes()` a bit later, after the `cql` reconnection.
It looks like decreasing `failure_detector_timeout_in_ms` doesn't make
the shutdown faster anymore.
We had some changes related to requests during shutdown like #24499
and #24714. They are probably the reason.
The current description is not accurate: the function doesn't throw
an exception if there's an invalid materialized view. Instead, it
simply logs the keyspaces that violate the requirement.
Furthermore, the experimental feature `views-with-tablets` is no longer
necessary for considering a materialized view as valid. It was dropped
in scylladb/scylladb@b409e85c20. The
replacement for it is the cluster feature `VIEWS_WITH_TABLETS`.
Fixesscylladb/scylladb#26420Closesscylladb/scylladb#26421
This patch adds tests for:
- tablet migration during view building
- tablet merge during view building.
Those tests were missing from the original testing plan.
We want to backport it to 2025.4 to ensure the release is bug-free.
Closesscylladb/scylladb#26414
* github.com:scylladb/scylladb:
test/cluster/test_view_building_coordinator: add test for tablet merge
test/cluster/test_view_building_coordinator: add test for tablet migration
Seastar httpd recommended users to stop using contiguous requet.content string and read body they need from request's input_stream instead. However, "official" deprecation of request content had been only made recently.
This PR patches REST API server to turn this feature on and patches few handlers that mess with request bodies to read them from request stream.
Using newer seastar API, no need to backport
Closesscylladb/scylladb#26418
* github.com:scylladb/scylladb:
api: Switch to request content streaming
api: Fix indentation after previous patch
api: Coroutinize set_relabel_config handler
api: Coroutinize set_error_injection handler
This dependency reference is carried into column_family handlers block to make get_built_views handler work. However, the handler in question should live in view_builder block, because it works with v.b. data. This PR moves the handler there, while at it, coroutinizes it, and removes the no longer needed sys.ks. reference from column_family.
API dependencies cleanup work, no need to backport
Closesscylladb/scylladb#26381
* github.com:scylladb/scylladb:
api: Fix indentation after previous patch
api: Coroutinize get_built_indexes handler code
api: Remove system_keyspace ref from column_family API block
api: Move get_built_indexes from column_family to view_builder
If mis-used, the script says
error: unrecognized option: ..., see ./scripts/pull_github_pr.sh -h for usage
but if using the suggested -h option it prints just the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26378
The PR #26154 dropped the `-fvisibility=hidden` compiler flag and
replaced it with `-fvisibility-inlines-hidden` as the former caused
issues in how the `noncopyable_function::operator bool` method executed
leading to incorrect return values. Apply the same fix to cmake.
Fixes#26391Closesscylladb/scylladb#26431
There are three handler that need to be patched all at once with the
server itself being marked with set_content_streaming
For two simple handler just get the content string with
read_entire_stream_contiguous helper. This is what httpd server did
anyway.
The "start_restore" handler used the contiguous contents to parse json
from using rjson utility. This handler is patched to use
read_entire_stream() that returns a vector of temporary buffers. The
rjson parser has a helper to pars from that vector, so the change is
also optimization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Without the invoke_on_all lambda, for simplicity
Also keep indentation "broken" for the ease of review
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.
The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
example, all 3 nodes that first joined the new group 0 are in the same
dc and rack, but only one of them becomes a voter because the voter
handler tries to make non-members in other dcs/racks voters.
Fixes#26321Closesscylladb/scylladb#26327
Some code wants its TLS sockets to close immediately without sending BYE
message and waiting for the response. Recent seastar update changed the
way this functionality is requested (scylladb/seastar#2986)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26253
In one of the constructors of `named_value`, the `allowed_values`
argument isn't used.
(This means that if some config entry uses this constructor,
the values aren't validated on the config layer,
and might give some lower layer a bad surprise).
Fix that.
Fixesscyllladb/scylladb#26371Closesscylladb/scylladb#26196
BYPASS CACHE is implemented for `bti_index_reader` by
giving it its own private `cached_file` wrappers over
Partitions.db and Rows.db, instead of passing it
the shared `cached_file` owned by the sstable.
But due to an oversight, the private `cached_file`s aren't
constructed on top of the raw Partitions.db and Rows.db
files, but on top of `cached_file_impl` wrappers around
those files. Which means that BYPASS CACHE doesn't
actually do its job.
Tests based on `scylla_index_page_cache_*` metrics
and on CQL tracing still see the reads from the private
files as "cache misses", but those misses are served
from the shared cached files anyway, so the tests don't see
the problem. In this commit we extend `test_bti_index.py`
with a check that looks at reactor's `io_queue` metrics
instead, and catches the problem.
Fixesscylladb/scylladb#26372Closesscylladb/scylladb#26373
This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names.
For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] }
Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs.
Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks.
Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported.
New feature, no backport required.
Co-authored with @bhalevy
Fixes https://github.com/scylladb/scylladb/issues/25269
Fixes https://github.com/scylladb/scylladb/issues/23525Closesscylladb/scylladb#26358
* github.com:scylladb/scylladb:
tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count
locator: Make hasher for endpoint_dc_rack globally accessible
test: tablets: Add test for replica allocation on rack list changes
test: lib: topology_builder: generate unique rack names
test: Add tests for rack list RF
doc: Document rack-list replication factor
topology_coordinator: Restore formatting
topology_coordinator: Cancel keyspace alter on broader set of errors
topology_coordinator: Make keyspace alter process options through as_ks_metadata_update()
cql3: ks_prop_defs: Preserve old options
cql3: ks_prop_defs: Introduce flattened()
locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()
tablet_allocator: Respect binding replicas to racks
locator: network_topology_strategy: Respect rack list when reallocating tablets
cql3: ks_prop_defs: Fail with more information when options are not in expected format
locator, cql3: Support rack lists in replication options
cql3: Fail early on vnode/tablet flavor alter
cql3: Extract convert_property_map() out of Cql.g
schema: Use definition from the header instead of open-coding it
locator: Abstract obtaining the number of replicas from replication_strategy_config_option
cql3, locator: Use type aliases for option maps
locator: Add debug logging
locator: Pass topology to replication strategy constructor
abstract_replication_strategy, network_topology_strategy: add replication_factor_data class
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.
After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.
In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.
After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.
High-level implementation strategy:
1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
and add a new one.
5. Update the user documentation.
Fixesscylladb/scylladb#23030
Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.
Closesscylladb/scylladb#25802
* github.com:scylladb/scylladb:
view: Stop requiring experimental feature
db/view: Verify valid configuration for tablet-based views
db/view: Require rf_rack_valid_keyspaces when creating view
test/cluster/random_failures: Skip creating secondary indexes
test/cluster/mv: Mark test_mv_rf_change as skipped
test/cluster: Adjust MV tests to RF-rack-validity
test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
db/view: Name requirement for views with tablets
The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name and because at the time it was introduced, namespace replica didn't exist yet. But this is a mistake which confuses people.
The querier is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module and namespace to make this more clear.
Code cleanup, no backport.
Closesscylladb/scylladb#26280
* github.com:scylladb/scylladb:
replica: move querier code to replica namespace
root,replica: mv querier to replica/
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).
But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.
The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.
Also, extend a related test so that it would catch the problem before the fix.
Fixesscylladb/scylladb#26393
Bugfix, needs backport to 2025.4.
Closesscylladb/scylladb#26394
* github.com:scylladb/scylladb:
sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db
test/boost/database_test: fix two no-op distributed loader tests
The reason for this seastar update is fixing #26190 - a service
level bug caused by a problem in scheduling group in seastar
implementation (seastar#2992).
* ./seastar 9c07020a...270476e7 (10):
> core: restore seastar_logger namespace in try_systemwide_memory_barrier
> Merge 'coroutines: support coroutines that copy their captures into the coroutine frame' from Avi Kivity
coroutines: advertise lambda-capture-by-value and test it
future: invoke continuation functions as temporaries
future: handle lvalue references in future continuations early
> resource: Tune up some allocate_io_queues() arguments
> Merge 'Add perf test hooks' from Travis Downs
perf_tests:add tests to verify pre-run hooks
per_tests: add pre-run hooks
perf-tests.md: update on measurement overhead
perf_tests_perf: a few more test variations
remove vestigial register_test method
> Add `touch` command to `rl` file processing
> Merge 'execution_stage: update stage name on scheduling_group rename' from Andrzej Jackowski
test: add sg_rename_recreate_with_the_same_name
test: add test_renaming_execution_stage in metric_test
test: add test_execution_stage_rename
execution_stage: update stage name on scheduling_group rename
execution_stage: reorganize per_group_stage_type
execution_stage: add concrete_execution_stage_base
execution_stage: move metrics setup to a separate method
> iotune: Fix warmup calculation bug and botched rebase
> Add missing `#pragma once` to ascii.rl
> iotune: Ignore measurements during warmup period
Fixes: https://github.com/scylladb/scylladb/issues/26190Closesscylladb/scylladb#26388
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).
But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.
The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.
Fixesscylladb/scylladb#26393
There are two tests which effectively check nothing.
They intend to check that distributed loader removes "leftover" sstable
files. So they create some incomplete sstables, run the test env
on the directory, and the files disappeared.
But the test env completely clears the test directory before
the distributed loader looks at the files, so the tests succeed trivially.
Fix that by adding a config knob to the test env which instructs it
not to clear the directory before the test.
We adjust the documentation to include the new
VECTOR_SEARCH_INDEXING permission and its usage
and also to reflect the changes in the maximal
amount of service levels.
This commit adds tests to verify the expected
behavior of the VECTOR_SEARCH_INDEXING permission,
that is, allowing GRANTing this permission only on
ALL KEYSPACES and allowing SELECT queries only on tables
with vector indexes when the user has this permission
This patch allows users with the VECTOR_SEARCH_INDEXING permission
to perform SELECT queries on tables that have a vector index.
This is needed for the Vector Store service, which
reads the vector-indexed tables, but does not require
the full SELECT permission.
This commit adds a new version of command_desc struct
that contains a set of permissions instead of a singular
permission. When this struct is passed to ensure/check_has_permission,
we check if the user has any of the included permission on the resource.
This patch adds a new permission: VECTOR_SEARCH_INDEXING,
that is grantable only for ALL KEYSPACES. It will allow selecting
from tables with vector search indexes. It is meant to be used
by the Vector Store service to allow it to build indexes without
having full SELECT permissions on the tables.
This commit extend the TABLE_LOAD_STATS RPC with data about the tablet
replica sizes and effective disk capacity.
Effective disk capacity of a node is computed as a sum of the sizes of
all tablet replicas on a node and available disk space.
This is the first change in the size based load balancing series.
Closesscylladb/scylladb#26035
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.
Fixes: scylladb/scylladb#26320
Broken documentation link fix for the tool help output, needs backport to all live source-available versions.
Closesscylladb/scylladb#26322
* github.com:scylladb/scylladb:
tools/scylla-sstable: fix doc links
release: adjust doc_link() for the post source-available world
tools/scylla-nodetool: remove trailing " from doc urls
This reference was only needed to facilitate get_built_indexes handler
to work. Now it's gone and the sys.ks. reference is no longer needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The handler effectively works with the view_builder and should be
registerd in the block that has this service captured.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The old logic assumes that replicas are spread across whole DC when
determining how many tablets we need to have at least 10 tablets per
shard. If replicas are actually confined to a subset of racks, that
will come up with a too high count and overshoot actual per-shard
count in this rack.
Similar problem happens for scaling-down of tablet count, when we try
to keep per-shard tablet count below the goal. It should be tracked
per-rack rather than per-DC, since racks can differ in how loaded they
are by RF if it's a rack-list.
Encode the dc identifier into each rack name so each dc will have its
own unique racks.
Just for easier distinction in logs.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There are several problems with how ALTER execution works with tablets.
1) Currently, option processing bypasses
ks_prop_defs::prepare_options(), and will pass them directly to
keyspace_metadata. This deviates from the vnode path, causing
discrepancy in logic. But also there will be some non-trivial options
post-processing added there - numeric RF will be replaced with a
rack list. We should preserve it in the tablet path which alters
the keyspace, otherwise it will fail when trying to construct
network_topology_strategy.
2) Option merging happens on the flat version of the map, which won't
work correctly with extended map which contains lists. We want
the new list to replace the old list or numeric RF, not get its items
merged. For example:
We want:
{'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']}
If we merge flattened options, we would get incorrect flattened options:
{'dc1': 3,
'dc1:0', 'rack1'
'dc1:1', 'rack2'}
3) We lose atomicity of update. Validation and merging which happens on the CQL
coordinator is done in a different group0 transaction context than mutation
generation inside topology coordinator later.
Fixes https://github.com/scylladb/scylladb/issues/25269
In 2d9b8f2, semantics of ALTER was changed for tablet-based keyspaces
which makes "replication" assignment act like +=, where replication
options are merged with the old options.
This merging is currently performed in the CQL statement level on
options map, before passing to topology coordinator. This will change
in later commit, so move merging here. Merging options of flattened
level will not be correct because it doesn't recognize nested
collections, like rack lists.
We want:
{'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']}
If we merge flattened options, we would get incorrect flattened options:
{'dc1': 3,
'dc1:0', 'rack1'
'dc1:1', 'rack2'}
Which cannot be parsed back into ks_prop_defs on the topology coordinator.
Refs https://github.com/scylladb/scylladb/pull/20208#issuecomment-3174728061
Refs #25549
Before, we would throw vague sstring_out_of_range from substr() when
the name doesn't have a nested key separate with ":", e.g "durable_writes"
instead of "durable_writes:durable_writes".
Allows per-DC replication factor to be either a string, holding a
numerical value, or a list of strings, holding a list of rack names.
The rack list is not respected yet by the tablet allocator, this is
achieved in subsequent commit.
This changes the format of options stored in the flattened map
in system_schema.keyspaces#replication. Values which are rack lists,
are converted into multiple entries, with the list index appended to
the key with ':' as the separator:
For example, this extended map:
{
'dc1': '3',
'dc2': ['rack1', 'rack2']
}
is stored as a flattened map:
{
'dc1': '3',
'dc2:0': 'rack1',
'dc2:1': 'rack2'
}
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
When walking the free-list of a pool or a span, the small-object code
casts the dereferenced `free_object*` to `void*`. This is unnecessary,
just use the `next` field of the `free_object` to look up the next free
object. I think this monkey business with `void*` was done to speed up
walking the free-list, but recently we've seen small-object --summarize
fail in CI, and it could be related.
Fixes: #25733Closesscylladb/scylladb#26339
Before, the `nodetool getendpoints` expected the key as one string separated by : (for example 1:val:ue). This caused errors if any part of the key had a colon because it was unclear whether a colon was a separator or part of the key.
This change adds a new API endpoint, `/storage_service/natural_endpoints/v2/{keyspace}`, which accepts composite partition keys as multiple key_component query parameters (e.g., ?key_component=1&key_component=val:ue). The `nodetool getendpoints` command was updated to support a new `--key-components` option, allowing users to pass key components as an array. The client and test infrastructure were extended to support multiple values for a query parameter, and tests were added to verify correct behavior with composite keys.
The previous method of passing partition keys as colon-separated strings is preserved for backward compatibility.
Backport is not required, since this change relies on recent Seastar updates
Fixes#16596Closesscylladb/scylladb#26169
* github.com:scylladb/scylladb:
docs: document --key-components option for getendpoints
test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints
nodetool: Introduce new option --key-components to specify compound partition keys as array
rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components
api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':'
rest_api_mock: support duplicate query parameters
test/rest_api: support multiple query values per key in RestApiSession.send()
nodetool: add support of new seastar query_parameters_type to scylla_rest_client
In test_two_tablets_concurrent_repair_and_migration_repair_writer_level
safe_rolling_restart returns ready cql. However, get_all_tablet_replicas
uses the cql reference from manager that isn't ready. Wait for cql.
Fixes: #26328Closesscylladb/scylladb#26349
sstable::compute_shards_for_this_sstable() has a temporary of type
std::vector<dht::token_range> (aka dht::partition_range_vector), which
allocates a contiguous 300k when loading an sstable from disk. This
causes large allocation warnings (it doesn't really stress the allocator
since this typically happens during startup, but best to clear the warning
anyway).
Fix this by changing the container to by chunked_vector. It is passed
to dht::ring_position_range_vector_sharder, but since we're the only
user, we can change that class to accept the new type.
Fixes#24198.
Closesscylladb/scylladb#26353
Some tests expect this error. Later, prepare_options() will be changed
in a way which would fail to accept new options in such case before
vnode/tablet flavor change is detected, tripping the tests.
It will become more complex when options will contain rack lists.
It's a good change regardless, as it reduces duplication and makes
parsing uniform. We already diverged to use stoi / stol / stoul.
The change in create_keyspace_statement.cc to add a catch clause is
needed because get_replication_factor() now throws
configuration_exception on parsing errors instead of
std::invalid_argument, so the existing catch clause in the outer scope
is not effective. That loop is trying to interpret all options as RF
to run some validations. Not all options are RF, and those are
supposed to be ignored.
In preparation for changing their structure.
1) std::map<sstring, sstring> -> replication_strategy_config_options
Parsed options. Values will become std::variant<sstring, rack_list>
2) std::map<sstring, sstring> -> property_definitions::map_type
Flattened map of options, as stored system tables.
Adds a parameterized test to verify that multiple --key-components arguments
are handled correctly by nodetool's getendpoints command. Ensures the
constructed REST request includes all key_component values in the expected format.
Allows getendpoints to accept components of partition key using the --key-components option.
Key components are passed as an array and sent to the new /natural_endpoints/v2/{keyspace} endpoint.
Adds a test case for the `/storage_service/natural_endpoints/v2/{keyspace}` endpoint,
verifying that it correctly resolves natural endpoints for a composite partition key
passed as multiple `key_component` query parameters.
The original `/storage_service/natural_endpoints` endpoint uses colon-separated strings for composite keys,
which causes ambiguity when key components contained colons.
This commits adds a new `/storage_service/natural_endpoints/v2/{keyspace}` endpoint that accepts partition key components
via repeated `key_component` query parameters to avoid this issue.
Previously, only the last value of a repeated query parameter was captured,
which could cause inaccurate request matching in tests. This update ensures
that all values are preserved by storing duplicates as lists in the `params` dict.
Previously, the send() method in RestApiSession only supported one value per query parameter key.
This patch updates it to support passing lists of values, allowing the same key to appear multiple
times in the query string (e.g. ?key=value1&key=value2).
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The possible benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.
In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
and lazy_comparable_bytes_from_ring_position
so that they performs the key translation eagerly,
all components to a single bytes_ostream in one synchronous call.
perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type):
```
Before:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6
After:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9
```
Enhancement, backport not required.
Closesscylladb/scylladb#26302
* github.com:scylladb/scylladb:
sstables/trie: BTI-translate the entire partition key at once
sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset()
sstables/trie: perform the BTI-encoding of position_in_partition eagerly
types/comparable_bytes: add comparable_bytes_from_compound
test/perf: add perf_bti_key_translation
We modify the requirements for using materialized views in tablet-based
keyspaces. Before, it was necessary to enable the configuration option
`rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS`
enabled, and using the experimental feature `views-with-tablets`.
We drop the last requirement.
We adjust code to that change and provide a new validation test.
We also update the user documentation to reflect the changes.
Fixesscylladb/scylladb#23030
Creating a materialized view or a secondary index in a tablet-based
keyspace requires that the user enabled two options:
* experimental feature `views-with-tablets`,
* configuration option `rf_rack_vaid_keyspaces`.
Because the latter has only become a necessity recently (in this series),
it's possible that there are already existing materialized views that
violate it.
We add a new check at start-up that iterates over existing views and
makes sure that that is not the case. Otherwise, Scylla notifies the user
of the problem.
We extend the requirements for being able to create materialized views
and secondary indexes in tablet-based keyspaces. It's now necessary to
enable the configuration option `rf_rack_valid_keyspaces`. This is
a stepping stone towards bringing materialized views and secondary
indexes with tablets out of the experimental phase.
We add a validation test to verify the changes.
Refs scylladb/scylladb#23030
Materialized views are going to require the configuration option
`rf_rack_valid_keyspaces` when being created in tablet-based keyspaces.
Since random-failure tests still haven't been adjusted to work with it,
and because it's not trivial, we skip the cases when we end up creating
or dropping an index.
The test will not work with `rf_rack_valid_keyspaces`. Since the option
is going to become a requirement for using views with tablets, the test
will need to be rewritten to take that into consideration. Since that
adjustment doesn't seem trivial, we mark the test as skipped for the
time being.
Currently, replica::tablet_map_to_mutation generates a mutation having a row per tablet.
With enough tablets (10s of thousands) in the table we observe reactor stalls when freezing / unfreezing such large mutations, as seen in https://github.com/scylladb/scylladb/pull/18095#issuecomment-2029246954, and I assume we would see similar stalls also when converting those mutation into canonical_mutation and back, as they are similar to frozen_mutation, and bit more expensive since they also save the column mappings.
This series takes a different approach than allowing freeze to yield.
`tablet_map_to_mutation` is changed to `tablet_map_to_mutations`, able to generate multiple split mutations, that when squashed together are equivalent to the previously large mutation. Those mutations are fed into a `process_mutation` callback function, provided by the caller, which may add those mutation to a vector for further processing, and/or process them inline by freezing or making a canonical mutation.
In addition, split the large mutations would also prevent hitting the commitlog maximum mutation size.
Closesscylladb/scylladb#18162
* github.com:scylladb/scylladb:
schema_tables: convert_schema_to_mutations: simplify check for system keyspace
tablets: read_tablet_mutations: use unfreeze_and_split_gently
storage_service: merge_topology_snapshot: freeze snp.mutations gently
mutation: async_utils: add unfreeze_and_split_gently
mutation: add for_each_split_mutation
tablets: tablet_map_to_mutations: maybe split tablets mutation
tablets: tablet_map_to_mutations: accept process_func
perf-tablets: change default tables and tablets-per-table
perf-tablets: abort on unhandled exception
Some of the new tests covering materialized views explicitly disabled
the configuration option `rf_rack_valid_keyspaces`. It's going to become
a new requirement for views with tablets, so we adjust those tests and
enable the option. There is one exception, the test:
`cluster/mv/test_mv_topology_change.py::test_mv_rf_change`
We handle it separately in the following commit.
Currently, the function unfreezes each schema mutation partition
and then checks if it's for a system keyspace.
This isn't really needed since we can check the partition key
using the frozen_mutation, skip it if the partition is for a system keyspace.
Note that the constructed partition_key just copies
the frozen partition_key_view, without copying or deserializing the
actual key contents.
Also, reserve `results` capacity using the queried
partitions' size to prevent reallocations of the results
vector.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Split the tablets mutations by number of rows, based on
`min_tablets_in_mutation` (currently calibrated to 1024),
similar to the splitting done in
`storage_service::merge_topology_snapshot`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We don't need to store all snp.mutations in a vector
and then freeze the whole vector. They can be frozen
one at a time and collected into a vector, while
maybe yielding between each mutation to prevent
stalls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unfreeze the frozen_mutation, possibly splitting it
based on max_rows. The process_mutation function
is called for each split mutation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allows processing of the split mutations one at a time.
This can reduce memory footprint as the caller
won't have to store a vector of the split mutations
and then convert it (e.g. freeze the mutations
or convert them to canonical mutations).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Split the generated tablets mutation if we run out of task
quota to prevent stalls, both when preparing the mutations
and later on when freezing/unfreezing them or converting
them to canonical_mutation and back.
Note that this will convert large mutation to long
vectors of mutations. A followup change is considered
to convert std::vector:s of mutations to chunked_vector
to prevent large allocations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for generating several mutations for the
tablet_map by calling process_func for each generated mutation.
This allows the caller to directly freeze those mutations
one at a time into a vector of frozen mutations or simililarly
convert them into canonical mutations.
Next patch will split large tablet mutations to prevent stalls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
tablets-per-table must be a power of 2, so round up 10000 to 16K.
also, reduce number of tables to have a total of about 100K
tablets, otherwise we hit the maximum commitlog mutation size
limit in save_tablet_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The directory utils/ is supposed to contain general-purpose utility
classes and functions, which are either already used across the project,
or are designed to be used across the project.
This patch moves 8 files out of utils/:
utils/advanced_rpc_compressor.hh
utils/advanced_rpc_compressor.cc
utils/advanced_rpc_compressor_protocol.hh
utils/stream_compressor.hh
utils/stream_compressor.cc
utils/dict_trainer.cc
utils/dict_trainer.hh
utils/shared_dict.hh
These 8 files together implement the compression feature of RPC.
None of them are used by any other Scylla component (e.g., sstables have
a different compression), or are ready to be used by another component,
so this patch moves all of them into message/, where RPC is implemented.
Theoretically, we may want in the future to use this cluster of classes
for some other component, but even then, we shouldn't just have these
files individually in utils/ - these are not useful stand-alone
utilities. One cannot use "shared_dict.hh" assuming it is some sort of
general-purpose shared hash table or something - it is completely
specific to compression and zstd, and specifically to its use in those
other classes.
Beyond moving these 8 files, this patch also contains changes to:
1. Fix includes to the 5 moved header files (.hh).
2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt
for the three moved source files (.cc).
3. In the moved files, change from the "utils::" namespace, to the
"netw::" namespace used by RPC. Also needed to change a bunch
of callers for the new namespace. Also, had to add "utils::"
explicitly in several places which previously assumed the
current namespace is "utils::".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25149
This PR migrates limits tests from dtest to this repository.
One reason is that there is an ongoing effort to migrate tests from dtest to here.
Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs.
Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time.
scylla-dtest PR that removes migrated tests:
[limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232)
Fixes#25097
This is a migration of existing tests to this repository. No need for backport.
Closesscylladb/scylladb#26077
* github.com:scylladb/scylladb:
test: dtest: limits_test.py: test_max_cells log level
test: dtest: limits_test.py: make the tests work
test: dtest: test_limits.py: remove test that are not being migrated
test: dtest: copy unmodified limits_test.py
Corrected spelling mistakes, typos, and minor wording issues to improve
the developer documentation.
No backport: There is no functional change, and the doc is mostly
relevant to master, so it doesn't need to be backported.
Closesscylladb/scylladb#26332
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.
The test cases in the file aren't run via an existing interface like
`do_with_cql_env`, but they rely on a more direct approach -- calling
one of the schema loader tools. Because of that, they manage the
`db::config` object on their own and don't enable the configuration
option `rf_rack_valid_keyspaces`.
That hasn't been a problem so far since the test doesn't attempt to
create RF-rack-invalid keyspaces anyway. However, in an upcoming commit,
we're going to further restrict views with tablets and require that the
option is enabled.
To prepare for that, we enable the option in all test cases. It's only
necessary in a small subset of them, but it won't hurt the enforce it
everywhere, so let's do that.
Refs scylladb/scylladb#23958
We add a named requirement, a function, for materialized views with tablets.
It decides whether we can create views and secondary indexes in a given
keyspace. It's a stepping stone towards modifying the requirements for it.
This way, we keep the code in one place, so it's not possible to forget
to modify it somewhere. It also makes it more organized and concise.
Set `lsa-timing` logger log level to `debug`. This will help with
the analysis of the whole spectrum of memory reclaim operation
times and memory sizes.
Refs #25097
Copy limits_test.py from scylla-dtest to test/cluster/dtest/limits_test.py.
Add license header.
Disable it for `debug`, `dev`, and `release` mode.
Refs #25097
The query namespace is used for symbols which span the coordinator and
replica, or that are mostly coordinator side. The querier is mainly in
this namespace due to its similar name, but this is a mistake which
confuses people. Now that the code was moved to replica/, also fix the
namespace to be namespace replica.
The querier object is a confusing one. Based on its name it should be in
the query/ module and it is already in the query namespace. But this is
actually a completely replica-side logic, implementing the caching of
the readers on the replica. Move it to the replica module to make this
more clear.
Delaying the BTI encoding of partition keys is a good idea,
because most of the time they don't have to be encoded.
Usually the token alone is enough for indexing purposes.
But for the translation of the `partition_key` part itself,
there's no good reason to make it lazy,
especially after we made the translation of clustering keys
eager in a previous commit. Let's get rid of the `std::generator`
and convert all cells of the partition key in one go.
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.
In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
so that it performs the translation eagerly,
all components to a single bytes_ostream.
Note: the name *lazy*_comparable_bytes_from_clustering_position
stays, because the interface is still lazy.
perf_bti_key_translation:
Before:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6
After:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9
Add a function which converts compound types (keys and key prefixes)
to BTI encoding.
It's almost the same as the existing `lazy_comparable_bytes_from_compound`
(in bti_key_translation.cc), except it eagerly serializes key components
to a bytes_ostream instead of lazily yielding them from a generator.
We will remove `lazy_comparable_bytes_from_compound` in a later commit.
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.
the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.
Fixesscylladb/scylladb#25290
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.
in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.
This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.
@@ -12,7 +12,7 @@ Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to re
## Contributing code to Scylla
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form to cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
and other sanity checks. It has no optimizations, which allows for debugging with tools like
@@ -361,7 +361,7 @@ avoid that the gold linker can be told to create an index with
More info at https://gcc.gnu.org/wiki/DebugFission.
Both options can be enable by passing `--split-dwarf` to configure.py.
Both options can be enabled by passing `--split-dwarf` to configure.py.
Note that distcc is *not* compatible with it, but icecream
(https://github.com/icecc/icecream) is.
@@ -370,7 +370,7 @@ Note that distcc is *not* compatible with it, but icecream
Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.
One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:
One way to do this is to create a local remote for the Seastar submodule in the Scylla repository:
seastar::metrics::make_histogram("batch_item_count_histogram",seastar::metrics::description("Histogram of the number of items in a batch request"),labels,
co_returnapi_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
"summary":"This method returns the N endpoints that are responsible for storing the specified key i.e for replication. the endpoint responsible for this key",
"type":"array",
"items":{
"type":"string"
},
"nickname":"get_natural_endpoints_v2",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace to query about.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"cf",
"description":"Column family name.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"key_component",
"description":"Each component of the key for which we need to find the endpoint (e.g. ?key_component=part1&key_component=part2).",
"description":"When set to true, drop unfixable sstables. Applies only to scrub mode SEGREGATE.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -1551,6 +1622,30 @@
}
]
},
{
"path":"/storage_service/exclude_node",
"operations":[
{
"method":"POST",
"summary":"Marks the node as permanently down (excluded).",
"type":"void",
"nickname":"exclude_node",
"produces":[
"application/json"
],
"parameters":[
{
"name":"hosts",
"description":"Comma-separated list of host ids to exclude",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/removal_status",
"operations":[
@@ -2956,7 +3051,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
throwexceptions::invalid_request_exception(format("Cannot create CDC log for a table {}.{}, because the keyspace uses tablets, and not all nodes support the CDC with tablets feature.",
utils::get_local_injector().inject("rest_api_keyspace_scrub_abort",[]{throwcompaction_aborted_exception("","","scrub compaction found invalid data");});
exceptions::invalid_request_exception("'replication_factor' tag is not allowed when executing ALTER KEYSPACE with tablets, please list the DCs explicitly"));
// If `rf_rack_valid_keyspaces` is enabled, it's forbidden to perform a schema change that
// would lead to an RF-rack-valid keyspace. Verify that this change does not.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.