- Remove manual gzip header parsing - libdeflate handles all format details
- Rename linearize_chunked_content to build_input_buffer and free chunks as we copy
- Add output chunking to split large decompressed data into 1MB chunks
- Add comment explaining libdeflate's whole-buffer requirement
- Use better initial size heuristic based on compression ratio
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
- Check if total_decompressed >= length_limit before allocating output buffer
- Prevents allocating a zero-sized buffer when limit is already reached
- Ensures clear error message when limit is exceeded
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
- Removed unused get_gzip_member_size function
- Rely on libdeflate_gzip_decompress to tell us how many input bytes were consumed
- Added check for zero bytes consumed to detect invalid state
- Simplified the logic by removing unnecessary header size tracking
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
- Created utils/gzip.hh header with ungzip function declaration
- Created utils/gzip.cc implementation using libdeflate
- Updated utils/CMakeLists.txt to include gzip.cc and link libdeflate
- Created comprehensive test suite in test/boost/gzip_test.cc
- Added gzip_test to test/boost/CMakeLists.txt
The implementation:
- Uses libdeflate for high-performance gzip decompression
- Handles chunked_content input/output (vector of temporary_buffer)
- Supports concatenated gzip files
- Validates gzip headers and detects invalid/truncated/corrupted data
- Enforces size limits to prevent memory exhaustion
- Runs in async context to avoid blocking the reactor
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.
scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.
Fixes#27055Closesscylladb/scylladb#27075
This patch adds the missing warning about the lack of possibility
to return the similarity distance. This will be added in the next
iteration.
Fixes#27086
It has to be backported to 2025.4 as this is the limitation in 2025.4.
Closesscylladb/scylladb#27096
Following 9b6ce030d0 ("sstables: remove quadratic (and possibly
exponential) compile time in parse()"), where we removed recursion
in reading, we do the same here for variadic write. This results
in a small reduction in compile time.
Note the problem isn't very bad here. This is tail-recursion, so likely
removed by the compiler during optimization, and we don't have additional
amplification due to future::then() double-compiling the ready-future
and unready-future paths. Still, better to avoid quadratic compile
times.
Closesscylladb/scylladb#27050
This reverts commit 43738298be.
This commit causes instability in dtests. Several non-gating dtests
started failing, as well as some gating ones, see #27047.
Closesscylladb/scylladb#27067Fixes#27047
The updates include:
- adding missing parts like topology states and table rows,
- documenting zero-token nodes,
- replacing the old recovery procedure with the new one.
Fixes#26412
Updates of internal docs (usually read on master) don't require
backporting.
Closesscylladb/scylladb#27022
* github.com:scylladb/scylladb:
docs/dev/topology-over-raft: update the recovery section
docs/dev/topology-over-raft: document zero-token nodes
docs/dev/topology-over-raft: clarify the lack of tablet-specific states
docs/dev/topology-over-raft: add the missing join_group0 state
docs/dev/topology-over-raft: update the topology columns
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.
Fixes https://github.com/scylladb/scylladb/issues/26866
Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing.
Closesscylladb/scylladb#26868
* https://github.com/scylladb/scylladb:
cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
cleanup: Add RESTful API to allow reset cleanup needed flag
The executor::add_stream_options() obtains local database reference from
proxy just to get feature service from it.
Similar chain is used in executor::update_time_to_live().
It's shorter to get features from proxy itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26973
…played
Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection.
Update _last_replay only if all batches were replayed.
Fixes: https://github.com/scylladb/scylladb/issues/24415.
Needs backport to all live versions.
Closesscylladb/scylladb#26793
* github.com:scylladb/scylladb:
test: extend test_batchlog_replay_failure_during_repair
db: batchlog_manager: update _last_replay only if all batches were replayed
This patch adds support for multiple audit log outputs.
If only one audit log output is enabled, the behavior does not change.
If multiple audit log outputs are enabled, then the `audit_composite_storage_helper` class is used. It has a collection
of `storage_helper` objects.
Performance testing shows that read query throughput and auth request throughput are consistent even at high reactor utilization. It can also be observed that read query latency increases a bit.
Read query ops = 60k/s
AUTH ops = 200/s
| Audit Mode | QUERY latency (p99) | Δ% vs none |
|------------|---------------------|------------|
| none | 777 | 0 |
|table| 801 | +3.09% |
|syslog | 803 | +3.35% |
|table,syslog | 818 | +5.28% |
Read query ops = 50k/s
AUTH ops = 200/s
| Audit Mode | QUERY latency (p99) | Δ% vs none |
|------------|---------------------|------------|
| none | 643 | 0 |
|table| 647 | +0.62% |
|syslog | 648 | +0.78% |
|table,syslog | 656 | +2.02% |
Detailed performance results are in the following Confluence document: [Audit performance impact test](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test)
Fixes#26022
Backport:
The decision is to not backport for now. After making sure it works on the latest release, and if there is a need, we can do it.
Closesscylladb/scylladb#26613
* github.com:scylladb/scylladb:
test: dtest: audit_test.py: add AuditBackendComposite
test: dtest: audit_test.py: group logs in dict per audit mode
audit: write out to both table and syslog
audit: move storage helper creation from `audit::start` to `audit::audit`
audit: fix formatting in `audit::start_audit`
audit: unify `create_audit` and `start_audit`
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.
A Vector Store node is now considered down if it returns an HTTP 500
server error. This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.
The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is explicitly 'SERVING'.
Fixes: VECTOR-187
Backport to 2025.4 as this feature is expected to be available in 2025.4.
Closesscylladb/scylladb#26413
* github.com:scylladb/scylladb:
vector_search: Improve vector-store health checking
vector_search: Move response_content_to_sstring to utils.hh
vector_search: Add unit tests for client error handling
vector_search: Enable mocking of status requests
vector_search: Extract abort_source_timeout and repeat_until
vector_search: Move vs_mock_server to dedicated files
Some of the columns were added, but the doc wasn't updated.
`upgrade_state` was updated in only one of the two places.
`ignore_nodes` was changed to a static column.
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.
To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.
The patch should be backported to 2025.4
Fixes https://github.com/scylladb/scylladb/issues/26244Closesscylladb/scylladb#26454
* github.com:scylladb/scylladb:
service/storage_service: migrate staging sstables in view building worker during intra-node migration
db/view/view_building_worker: support sstables intra-node migration
db/view_building_worker: fix indent
db/view/view_building_worker: don't organize staging sstables by last token
A Vector Store node is now considered down if it returns an HTTP 5xx status.
This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.
The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is 'SERVING'.
Move the response_content_to_sstring utility function from
vector_store_client.cc to utils.hh to enable reuse across
multiple files.
This refactoring prepares for the upcoming `client.cc` implementation
that will also need this functionality.
Introduce dedicated unit tests for the client class to verify existing
functionality and serve as regression tests.
These tests ensure that invalid client requests do not cause nodes to
be marked as down.
Extend the mock server to allow inspecting incoming status requests and
configuring their responses.
This enables client unit tests to simulate various server behaviors,
such as handling node failures and backoff logic.
The `abort_source_timeout` and `repeat_until` functions are moved to
the shared utility header `test/vector_search/utils.hh`.
This allows them to be reused by upcoming `client` unit tests, avoiding
code duplication.
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.
This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.
To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.
Backport is not required, it is an improvement
Refs #21776Closesscylladb/scylladb#26444
* github.com:scylladb/scylladb:
boost/repair_test: add repair reader integrity verification test cases
test/lib: allow to disable compression in random_mutation_generator
sstables: Skip checksum and digest reads for unlinked SSTables
table: enable integrity checks for streaming reader
table: Add integrity option to table::make_sstable_reader()
sstables: Add integrity option to create_single_key_sstable_reader
This patch fixes 2 issues at one go:
First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.
Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.
Fixes#26982
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#26991
Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made
it hitting more often, as _ss was called earlier, but it's not released yet.
Closesscylladb/scylladb#26779
* github.com:scylladb/scylladb:
service: attach storage_service to migration_manager using pluggabe
service: migration_manager: corutinize merge_schema_from
service: migration_manager: corutinize reload_schema
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.
However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:
1. On the topology coordinator, the join completes and the joining node
becomes normal, but the joining node's state lags behind. Since it's
not synchronized by the barrier, it could be in an old state such as
`write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
new replica.
3. The new node applies the base mutation but doesn't generate a view
update for it, because it calculates the base-view pairing according
to its own state and replication map, and determines that it doesn't
participate in the base-view pairing.
Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.
Fixes https://github.com/scylladb/scylladb/issues/26976
backport to previous versions since it fixes a bug in MV with vnodes
Closesscylladb/scylladb#27008
* github.com:scylladb/scylladb:
test: add mv write during node join test
topology_coordinator: include joining node in barrier
This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`.
The PR improves codes readability and efficiency, no behavior changes.
backport: this is a refactoring/optimization, no need to backport
Closesscylladb/scylladb#26907
* https://github.com/scylladb/scylladb:
topology_coordinator: drop unused exec_global_command overload
topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
topology_state_machine: inline get_excluded_nodes
messaging_service: simplify and optimize ban_host
storage_service: topology_state_load: extract topology variable
topology_coordinator: excluded_tablet_nodes -> ignored_nodes
topology_state_machine: add excluded_tablet_nodes field
vector_search: Add backoff for failed nodes
Introduces logic to mark nodes that fail to answer an ANN request as
"down". Down nodes are omitted from further requests until they
successfully respond to a health check.
Health checks for down nodes are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
Client list management is moved to separate files (clients.cc/clients.hh)
to improve code organization and modularity.
References: VECTOR-187.
Backport to 2025.4 as this feature is expected to be available in 2025.4.
Closesscylladb/scylladb#26308
* github.com:scylladb/scylladb:
vector_search: Set max backoff delay to 2x read request timeout
vector_search: Report status check exception via on_internal_error_noexcept
vector_search: Extract client management into dedicated class
vector_search: Add backoff for failed clients
vector_search: Make endpoint available
vector_search: Use std::expected for low-level client errors
vector_search: Extract client class
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).
Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.
A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
Some accesses to `auth::service` are not affected and we do not modify
those.
Fixesscylladb/scylladb#26816
Backport: yes. This is a fix of a regression.
Closesscylladb/scylladb#26856
* github.com:scylladb/scylladb:
test/cluster/test_maintenance_mode.py: Wait for initialization
test: Disable maintenance mode correctly in test_maintenance_mode.py
test: Fix keyspace in test_maintenance_mode.py
service/qos: Do not crash Scylla if auth_integration absent
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.
One test needs adjustment as it relies on RBNO being used for all node
ops.
Fixes: #24664Closesscylladb/scylladb#26330
The view building coordinator manages the process by sending RPC
requests to all nodes in the cluster, instructing them what to do.
If processing that message fails, the coordinator decides if it
wants to retry it or (temporarily) abandon the work.
An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.
Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.
To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.
We provide a reproducer test.
Fixesscylladb/scylladb#26686
Backport: impact on the user. We should backport it to 2025.4.
Closesscylladb/scylladb#26729
* github.com:scylladb/scylladb:
tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
db/view/view_building_coordinator: Rate limit logging failed RPC
db/view: Add backoff when RPC fails
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixes https://github.com/scylladb/scylladb/issues/26340
the issue affects all previous releases, backport to improve stability
Closesscylladb/scylladb#26533
* github.com:scylladb/scylladb:
test: test concurrent writes with column drop with cdc preimage
cdc: check if recreating a column too soon
cdc: set column drop timestamp in the future
migration_manager: pass timestamp to pre_create
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
Since error is not printed to stdout, when working with multiple
files, we don't know whith which sstable the error is associated with.
Closesscylladb/scylladb#27009
The maximum backoff delay for status checking now depends on the
`read_request_timeout_in_ms` configuration option. The delay is set
to twice the value of this parameter.
This exception should only occur due to internal errors, not client or external issues.
If triggered, it indicates an internal problem. Therefore, we notify about this exception
using on_internal_error_noexcept.