This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
All important SSTable components (Index, Partitions, Rows, Summary, Filter, CompressionInfo, and TOC) are covered.
Several test cases where introduced to verify expected behaviour.
Backport is not required, it is a new feature
Fixes#20100Closesscylladb/scylladb#27287
* github.com:scylladb/scylladb:
sstable_test: add verification testcases of SSTable components digests persistance
sstables: store digest of all sstable components in scylla metadata
sstables: Add TemporaryScylla metadata component type
sstables: Extract file writer closing logic into separate methods
sstables: Add components_digests to scylla metadata components
sstables: Implement CRC32 digest-only writer
Fixes#27367Fixes#27362Fixes#27366
Makes http URL parser handle IPv6.
Makes KMIP host setup handle IPv6 hosts + use system trust if no truststore set
Moves Azure/KMS code to use shared http URL parser to avoid same regex everywhere.
Closesscylladb/scylladb#27368
* github.com:scylladb/scylladb:
ear::kms/ear::azure: Use utils::http URL parsing
ear::kmip_host: Handle ipv6 hosts + use system trust when not specified
utils::http: Handle ipv6 numeric host part in URL:s
It is observed that:
repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update
system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb:
seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum
token,maximum token) is not in the format of (start, end])
This is because repair checks the end of the range to be repaired needs
to be inclusive. When small_table_optimization is enabled for regular
repair, a (minimum token,maximum token) will be used.
To fix, we can relax the check of (start, end] for the min max range.
Fixes#27220Closesscylladb/scylladb#27357
The rf_rack_valid_keyspaces option needs to be turned on in order to
allow creating materialized views in tablet keyspaces with numeric RF
per DC. This is also necessary for secondary indexes because they use
materialized views underneath. However, this option is _not_ necessary
for vector store indexes because those use the external vector store
service for querying the list of keys to fetch from the main table, they
do not create a materialized view. The rf_rack_valid_keyspaces was, by
accident, required for vector indexes, too.
Remove the restriction for vector store indexes as it is completely
unnecessary.
Fixes: SCYLLADB-81
Closesscylladb/scylladb#27447
During b9199e8b24
reivew it was suggested to use standard for loop
but when erasing element it causes increment on
invalid iterator, as role could have been erased
before.
This change brings back original code.
Fixes: https://github.com/scylladb/scylladb/issues/27422
Backport: no, offending commit not released yet
Closesscylladb/scylladb#27444
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in the sstable structure and later persisted to disk
as part of the Scylla metadata component during writer::consume_end_of_stream.
After 39cec4ae45 node join may fail with either "request canceled" notification or (very rarely) because it was banned. Depend on timing. The series fixes the test to check for both possibilities.
Fixes#27320
No need to backport since the flakiness is in the mater only.
Closesscylladb/scylladb#27408
* https://github.com/scylladb/scylladb:
test: fix test_coordinator_queue_management flakiness
test/pylib: allow expected_error in server_start to contain regular expression
The storage_service REST API uses `group0` internally. Before this
patch, it was possible to send an HTTP request before `group0` was
initialized, which resulted in a segmentation fault. Therefore,
this patch delays the setup of the storage_service REST API.
Additionally, `test_rest_api_on_startup` is added to reproduce the problem.
Fixes: https://github.com/scylladb/scylladb/issues/27130
No backport. It's a crash fix but possible only if a request is sent in a very specific phase of a node start.
Closesscylladb/scylladb#27410
* github.com:scylladb/scylladb:
test: add test_rest_api_on_startup
main: delay setup of storage_service REST API
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:
```c++
// [1, 2, 3, 0] --> [0, 1, 3, 6]
std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```
However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.
As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.
An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57
The fix is simple: the initial value is passed as uint64_t instead of int.
Fixes https://github.com/scylladb/scylladb/issues/27417Closesscylladb/scylladb#27418
Fixes#27362
The KMIP host connector should handle ipv4 connections (named or numeric).
It also should fall back to system trust when truststore is not specified.
Fixes#27366
A URL with numeric host part formats special in case of ipv6,
to avoid confusion with port part.
The parser should handle this.
I.e.
http://[2001:db8:4006:812::200e]:8080
v2:
* Include scheme agnostic parse + case insensitive scheme matching
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.
When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.
Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were:
* Pruning with concurrency 1 took total 1416 seconds
* Pruning with concurrency 2 took total 731 seconds
* Pruning with concurrency 10 took total 234 seconds
* Pruning with concurrency 100 took total 171 seconds
So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE
Fixes https://github.com/scylladb/scylladb/issues/27070Closesscylladb/scylladb#27097
* github.com:scylladb/scylladb:
mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
cql: add CONCURRENCY to the USING clause
Currently, _flush_time was stored as a std::optional<gc_clock::time_point>
and std::nullopt indicates that the flush was needed but failed. It's confusing
for the caller and does not work as expected since the _flush_time is initialized
with value (not optional).
Change _flush_time type to gc_clock::time_point. If a flush is needed but failed,
get_flush_time() throws an exception.
This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319
but it was mistakenly overwritten during rebases.
Refs: https://github.com/scylladb/scylladb/issues/24415.
Closesscylladb/scylladb#26794
p11-kit has hardcoded paths for the trust paths. Of course, each
Linux distribution hardcodes those paths differently. As a result,
our relocatable gnutls, which uses p11-kit-trust.so to process the
trust paths, needs some overrides to select the right paths.
Currently, we use p11_kit_override_system_files(), a p11-kit API
intended for testing, but which worked well enough for our purpose,
to override the trust module configuration.
Unfortunately, starting (presumably [1]) in gnutls 3.8.11, gnutls
changed how it works with p11-kit and our override is now ignored.
This was likely unintentional, but there appears to be a better way:
instead of letting gnutls auto-load the trust module from a hacked
configuration, we load the modules outselves using
gnutls_pkcs11_init(GNUTLS_PKCS11_FLAG_MANUAL) and
gnutls_pkcs11_add_provider(). These appear to be intended for the purpose.
We communicate the paths to the scylla executable using an environment
variable. This isn't optimal, but is much easier than adding a command
line variable since there are multiple levels of command line parsing due
to the subtool mechanism.
With this, we unlock the possibility to upgrade gnutls to newer versions.
[1] aa5f15a872Closesscylladb/scylladb#27348
After 39cec4ae45 node join may fail with either "request canceled"
notification or (very rarely) because it was banned. Depend on timing.
The patch fixes the test to check for both possibilities.
Fixes#27242
Similar to AWS, google services may at times simply return a 503,
more or less meaning "busy, please retry". We rely for most cases
higher up layers to handle said retry, but we cannot fully do so,
because both we reach this code sometimes through paths that do
no such thing, and also because it would be slightly inefficient,
since we'd like to for example control the back-off for auth etc.
This simply changes the existing retry loop in gcp_host to
be a little more forgiving, special case 503 errors and extend
the retry to the auth part, as well as re-use the
exponential_backoff_retry primitive.
v2:
* Avoid backoff if refreshing credentials. Should not add latency due to this.
* Only allow re-auth once per (non-service-failure-backoff) try.
* Add abort source to both request and retry
v3:
* Include timeout and other server errors in retry-backoff
v4:
* Reorder error code handling correctly
Closesscylladb/scylladb#27267
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.
Redirections are added for all the removed pages.
Fixes https://github.com/scylladb/scylladb/issues/26871Closesscylladb/scylladb#27277
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
that is reverted
During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.
This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.
Fixes https://github.com/scylladb/scylladb/issues/26512
It's a pre existing issue. Backport is required to all recent 2025.x versions.
Closesscylladb/scylladb#27413
* github.com:scylladb/scylladb:
topology_coordinator: Fix the indentation for the cleanup_target case
topology_coordinator: Add barrier to cleanup_target
test_node_failure_during_tablet_migration: Increase RF from 2 to 3
Add TemporaryScylla component type to make atomic updates of SSTable Scylla metadata using temporary files
and atomic rename operations possible. This will be needed in further commit to rewrite metadata together with
the statistics component.
This patch adds separate group for vector search parameters in the
documentation and fixes small typos and formatting.
Fixes: SCYLLADB-77.
Closesscylladb/scylladb#27385
When a vector store node becomes unreachable, a client request sent
before the keep-alive timer fires would hang until the CQL query
timeout was reached.
This occurred because the HTTP request writes to the TCP buffer and then
waits for a response. While data is in the buffer, TCP retransmissions
prevent the keep-alive timer from detecting the dead connection.
This patch resolves the issue by setting the `TCP_USER_TIMEOUT` socket
option, which applies an effective timeout to TCP retransmissions,
allowing the connection to fail faster.
Closesscylladb/scylladb#27388
Scylla uses zlib, through the header <zlib.h>, in sstable compression.
We also want to use it in Alternator for gzip-compressed requests.
We never actually required zlib explicltly in install-dependencies.sh,
we only get it through transitive dependencies. But it's better to
require it explicitly so this is what we do in this patch.
In Fedora, we use the newer, more efficient, zlib-ng which is API-
compatible with the classic zlib. Unfortunately, the Debian zlib-ng
package is *not* drop-in compatible with zlib (you need to include
a different header file <zlib-ng.h>) so we use the classic zlib.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27238
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
that is reverted
During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.
This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.
Fixes https://github.com/scylladb/scylladb/issues/26512
The patch prepares the test for additional write workload to be
executed in parallel with node failures. With the original RF=2,
QUORUM is also 2, which causes writes to fail during node outage.
To address it, the third rack with a single node is added and the
replication factor is increased to 3.
This test verifies that REST API requests are handled properly
when a server is started or restarted. It is used to verify
the fix for scylladb/scylladb#27130, where a server failed with a
segmentation fault when `storage_service/raft_topology/reload` was
called too early.
Refs: scylladb/scylladb#27130
The storage_service REST API uses `group0` internally. Before this
patch, it was possible to send an HTTP request before `group0` was
initialized, which resulted in a segmentation fault. Therefore,
this patch delays the setup of the storage_service REST API.
Fixes: scylladb/scylladb#27130
Creating endpoint conf can be made with the s3_server method
Getting boto3 resource from s3_server itself is also possible
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27380
The test executes a LWT query in order to create a paxos state table and
verify the table properties. However, after executing the LWT query, the
table may not exist on all nodes but only on a quorum of nodes, thus
checking the properties of the table may fail if the table doesn't exist
on the queried node.
To fix that, execute a group0 read barrier to ensure the table is
created on all nodes.
Fixesscylladb/scylladb#27398Closesscylladb/scylladb#27401
The metrics extension now includes validation to detect missing metrics. This validation caused failures during multiversion publication because older versions did not generate all required properties.
Instead of fixing each branch, a strict mode flag was introduced to control when validation should run.
Strict mode is enabled in the workflow that validates pull requests, ensuring that new changes meet the expected metrics.
During multiversion builds, validation errors are now logged but do not raise exceptions, which prevents build failures while still providing visibility into missing data.
docs: verbose mode
docs: verbose mode
Closesscylladb/scylladb#27402
Since Alternator is now using tablets by default, it's no longer possible
to create an Alternator table on a 3-node cluster with a single rack -
you need to have 3 racks to support RF=3.
Most of the multi-node Alternator tests in test/cluster/test_alternator.py
were already fixed to use a 3-rack cluster, but one test was missed
because it was marked "xfail" so its new failure to create the table was
missed. This patch adds the missing 3-rack setup, so the xfailing test
returns to failing on the real bug - not on the table creation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27382
Even though that `view_building_coordinator::work_on_view_building` has
an `if` at the very beginning which checks whether the currently
processed base table is set, it only prints a message and continues
executing the rest of the function regardless of the result of the
check. However, some of the logic in the function assumes that the
currently processed base table field is set and tries to access the
value of the field. This can lead to the view building coordinator
accessing a disengaged optional, which is undefined behavior.
Fix the function by adding the clearly missing `co_await` to the check.
A regression test is added which checks that the view building state
observer - a different fiber which used to print a weird message due to
erroneus view building coordinator behavior - does not print a warning.
Fixes: scylladb/scylladb#27363Closesscylladb/scylladb#27373
The motivation for the update is using newer version of scylla-driver
that supports new event type CLIENT_ROUTES_CHANGE.
* tools/cqlsh 22401228...6badc992 (2):
> Update scylla-driver version to 3.29.6
> Revert "Migrate workflows to Blacksmith"
Closesscylladb/scylladb#27359
Fixes#27268
Refs #27268
Includes the auth call in code covered by backoff-retry on server
error, as well as moves the code to use the shared primitive for this
and increase the resilience a bit (increase retry count).
v2:
* Don't do backoff if we need to refresh credentials.
* Use abort source for backoff if avail
v3:
* Include other retryable conditions in auth check
Closesscylladb/scylladb#27269
This reverts commit b0643f8959, reversing
changes made to e8b0f8faa9.
The change forgot to update
sstables_manager::get_highest_supported_format(), which results in
/system/highest_supported_sstable_version still returning me, confusing
and breaking tests.
Fixes: scylladb/scylla-dtest#6435Closesscylladb/scylladb#27379
The root cause for the hanging test is a concurrency deadlock.
`vector_store_client` runs dns refresh time and it is waiting for the condition
variable.After aborting dns request the test signals the condition variable.
Stopping the vector_store_client takes time enough to trigger the next dns
refresh - and this time the condition variable won't be signalled - so
vector_store_client will wait forever for finish dns refresh fiber.
The commit fixes the problem by waiting for the condition variable only once.
Fixes: #27237
Fixes: VECTOR-370
Closesscylladb/scylladb#27239
This PR introduces two key improvements to the robustness and resource management of vector search:
Proper Abort on CQL Timeout: Previously, when a CQL query involving a vector search timed out
, the underlying ANN query to the vector store was not aborted and would continue to run. This has been fixed by ensuring the abort source is correctly signaled, terminating the ANN request when its parent CQL query expires and preventing unnecessary resource consumption.
Faster Failure Detection: The connection and keep-alive timeouts for vector store nodes were excessively long (2 and 11 minutes, respectively), causing significant delays in detecting and recovering from unreachable nodes. These timeouts are now aligned with the request_timeout_in_ms setting, allowing for much faster failure detection and improving high availability by failing over from unresponsive nodes more quickly.
Fixes: SCYLLADB-76
This issue affects the 2025.4 branch, where similar HA recovery delays have been observed.
Closesscylladb/scylladb#27377
* github.com:scylladb/scylladb:
vector_search: Fix ANN query abort on CQL timeout
vector_search: Reduce connection and keep-alive timeouts