~2076 files used "Copyright (C) YYYY-present ScyllaDB" while
~88 files used "Copyright (C) YYYY ScyllaDB". This
inconsistency leads to unnecessary code review discussions
and gradual spread of the less common format.
Standardize all ScyllaDB copyright headers to use -present.
Fixes SCYLLADB-1984
Closesscylladb/scylladb#29876
Changed seastar::http::experimental to seastar::http to reflect
graduation of the seastar http API.
Changed call to seastar::rename_file() (in sstables/storage.cc,
sstables/sstable_directory.cc, sstable/sstables.cc and
db/hints/internal/hint_storage.cc) to reflect new default parameter.
Updated scylla_gdb test helper get_task() to work with updated
accept loop in Seatar. This is just test code (attempts to find
a task to operate on), not used in real scylla-gdb.py work, but
nevertheless the adjustment keeps backward compatibility.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1798
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2043
* seastar 485a62b2...510f3148 (43):
> reactor_backend: fix iocb double-free and shutdown hang during AIO teardown
> file: fix default DMA alignment
> http: add to_reply() to redirect_exception with extra-header support
> core: propagate syscall errors via `coroutine::exception`
> file: assert dma alignments are powers of two
> doc: Document undocumented io_tester features and fix output example
> backtrace: print the build_id along with the backtrace
> reactor: default to oneline backtraces
> Merge 'json: formatter: support types with user-defined conversion to sstring' from Benny Halevy
tests: json_formatter: test formatter::write with string types
json: formatter: support types with user-defined conversion to sstring
> httpd_test: fix build failure with Seastar_SSTRING=OFF
> net/tls: introduce ssl_call wrapper for SSL I/O
> build: disable unused command line argument error for C++ module
> coroutine/generator: fix setup of generator's waiting task
> tests/tls: set 1000-day validity for self-signed CA cert
> net: tls: openssl: disable certificate compression
> reactor: reduce steady_clock::now() calls per scheduling quantum
> fair_queue: remove notify_request_finished()
> loop: use small_vector for parallel_for_each_state incomplete futures
> dodge false sharing in spinlock
> Merge 'Handle nowait support for reads and writes independently' from Pavel Emelyanov
file: Change nowait_works mode detection
file: Introduce read-only nowait_mode
filesystem: Make nowait_works bit a enum class too
file: Make nowait_works bit a enum class
> Merge 'net/tls: improve OpenSSL error queue hygiene' from Gellért Peresztegi-Nagy
net/tls: assert clean error queue before SSL operations
net/tls: clear error queue after successful SSL operations
net/tls: clear error queue after successful SSL_CTX_new
net/tls: drain error queue on unexpected error codes
net/tls: use make_openssl_error for BIO creation failure
> vla.hh: add missing includes
> Merge 'smp: make smp::count non-static' from Avi Kivity
smp: convert all smp::count usages to instance-aware alternatives
smp: add per-instance shard_count and this_smp() infrastructure
disk_params: document pre-init smp::count access with explicit 0
reactor_backend: document pre-init smp::count access with explicit 0
tests: alien_test: pass shard count to alien thread explicitly
> build: fix cmake missing ninja on Ubuntu 26.04
> rpc: Fix uint64 wraparound of expired timeout in send_entry()
> Merge 'Generalize some RPC tests' from Pavel Emelyanov
tests: Generalize async connection-based scheduling RPC tests
tests: Generalize sync connection-based scheduling RPC tests
tests: Remove redundant variadic/nonvariadic RPC tuple tests
tests: Generalize max timeout RPC tests
> net: tls: openssl: Share BIO ptrs across shards
> http: fix compilation on clang 22 with c++26
> build: openssl tools needed for test cert generation
> reactor: support rename2
> future: fix forwarding of reference types
> Merge 'Zero-copy http chunked data sink' from Pavel Emelyanov
http: Make chunked data sink zero-copy
tests/prometheus_http: Rewrite on top of http::client
tests/httpd: Rewrite content_length_limit on top of http::client
> tests: Replace ad-hoc http_consumer with production HTTP parser
> Merge 'co_return to accept same expressions and types as return' from Alexey Bashtanov
tests/unit/{coroutines,futures}: strict types on co_return and set_value
api: introduce version 10:
core/{coroutine,future}: make `co_return` more strict with types
core/{coroutine,future}: preparations to fix `co_return` type semantics
> Merge 'Perftune.py: add special handling for mlx5 rss queues number calculation' from Vladislav Zolotarov
perftune.py: NetPerfTuner: enhance RSS (a.k.a. "Rx") queues accounting for mlx5 devices
perftune.py: update docstring of NetPerfTuner.__get_rps_cpus() method
perftune.py: add a method that parses and models the output of the 'ethtool -l' command for a given interface
> httpd: rewrite do_accepts/do_accept_one as coroutines
> file: add mmap support to file
> http: Move client code out of experimental namespace
> file: add hugetlbfs support to file system detection
> tests: Replace test_source_impl with util::as_input_stream
> tests: Replace buf_source_impl with util::as_input_stream
> Merge 'rpc_tester: expose throuput for rpc tester' from Marcin Szopa
rpc_tester: remove unused payload size variable from job_rpc_streaming class
rpc_tester: add start time tracking for throughput calculation, print throughput and msg/s for job_rpc
rpc_tester: refactor result emission to use dedicated functions for messages and throughput
> iostream: cast first argument of `std::min` to `size_t`
Closesscylladb/scylladb#29952
When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site.
This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths:
- Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`)
- Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`)
- `parse_assert()` failures (via `on_parse_error()`)
- BTI parse errors (via `on_bti_parse_error()`)
The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure.
The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption.
**Commit breakdown:**
1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc`
2. `on_parse_error()` and `on_bti_parse_error()` check the new flag
3. All ~50 `throw malformed_sstable_exception(...)` sites migrated
4. Both `throw bufsize_mismatch_exception(...)` sites migrated
Refs: SCYLLADB-1087
Backport: new feature, no backport
Closesscylladb/scylladb#29324
* github.com:scylladb/scylladb:
sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
sstables: introduce --abort-on-malformed-sstable-error infrastructure
sstables: refactor parse_path() to return std::expected<> instead of throwing
s3_storage::make_source previously ignored its file f parameter and
constructed a fresh s3::client::readable_file per call. The new
file's _stats cache was empty, so the first dma_read_bulk issued a
HEAD via maybe_update_stats just to learn the object size before
the ranged GET -- one ~50 ms RTT per uncached read.
The file f passed in by the two callers (sstable::data_stream for
Data.db reads and index_reader::make_context for Index.db reads)
already wraps the sstable's _data_file or _index_file. Those file
objects had their stats populated at sstable open time by
update_info_for_opened_data, and they were wrapped with the
configured file_io_extensions when opened via open_component. Reusing
them is exactly what filesystem_storage::make_source does (one-line
make_file_data_source over f), so the s3 path simply matches it.
readable_file::size() is also updated to route through
maybe_update_stats(), so a .size() call populates the _stats cache
the same way .stat() does -- preventing a redundant HEAD on the
first subsequent read of components opened with .size() (Index,
Partitions, Rows in update_info_for_opened_data).
Closesscylladb/scylladb#29766
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`)
returns success, source sstable files in the `upload/` directory may
still be on disk. `mark_for_deletion()` only sets an in-memory flag; the
actual file deletion runs lazily when the last `shared_sstable`
reference drops.
This leaves a window between API success and physical deletion where a
follow-up scan of the upload directory can detected sstables that will be deleted soon.
This might cause failure because SSTable will be already wiped during processing.
For fix:
Force unlink to complete before `stream()` returns, so the upload
directory is in a consistent state by the time the API reports success.
For tablet streaming, partially-contained sstables participate in
multiple per-tablet batches; eagerly unlinking after each batch would
break the next batch that still needs to read the file. A
`defer_unlinking` flag on the streamer postpones the explicit unlink
until after all batches complete (called once at the end of
`tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of
`stream_sstable_mutations`.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647
Backport is required, as it is a bug fix that was introduced in 517a4dc4df.
Closesscylladb/scylladb#29599
* github.com:scylladb/scylladb:
sstables_loader: synchronously unlink streamed sstables before returning
sstables: make sstable::unlink() idempotent
Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation).
This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562
No need to backport this, keyspace over object storage is experimental feature
Closesscylladb/scylladb#29659
* github.com:scylladb/scylladb:
db, sstables: add node_owner to sstables registry primary key
db, sstables: rename sstables registry column owner to table_id
Replace the two remaining direct 'throw bufsize_mismatch_exception(...)'
call sites with the new throw_bufsize_mismatch_exception() helper, which
routes through throw_malformed_sstable_exception() and thus also respects
the --abort-on-malformed-sstable-error flag.
Affected files:
- sstables/sstables.cc (1 site, in check_buf_size())
- sstables/m_format_read_helpers.cc (1 site, in check_buf_size())
Replace all direct 'throw malformed_sstable_exception(...)' call sites
with the new throw_malformed_sstable_exception() helper, which respects
the --abort-on-malformed-sstable-error flag.
Both functions now check abort_on_malformed_sstable_error() first. If
set, they log the error and call std::abort() directly, generating a
coredump. Otherwise they fall through to the existing on_internal_error()
path, which is in turn controlled by --abort-on-internal-error.
Add scoped_no_abort_on_malformed_sstable_error RAII guard (modeled after
seastar::testing::scoped_no_abort_on_internal_error) and use it in all
tests that intentionally corrupt sstables and expect
malformed_sstable_exception to be thrown rather than the process aborting.
Add the --abort-on-malformed-sstable-error command-line option and the
supporting infrastructure. When set, any malformed sstable error will
abort the process and generate a coredump instead of throwing an
exception. This is useful for debugging memory corruption that may
manifest as apparent sstable corruption.
The implementation introduces:
- throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception()
helper functions in sstables/sstables.cc, which check the new flag and
either abort (with logging) or throw the appropriate exception.
- set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error()
to control the per-process atomic flag.
- abort_on_malformed_sstable_error config option (LiveUpdate, default false)
wired up in main.cc alongside abort_on_internal_error.
Call-site migration will follow in subsequent commits.
make_entry_descriptor() and the two overloads of parse_path() used to signal
parse failures by throwing malformed_sstable_exception, which made parse_path()
expensive to use as a probe (e.g. to classify directory entries).
Change make_entry_descriptor() and both parse_path() overloads to return
std::expected<T, sstring>, where the sstring carries the error message on
failure, eliminating the exception overhead at probe call sites.
Call sites that previously caught malformed_sstable_exception to treat the
path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc,
tools/scylla-sstable.cc) now check the expected result directly.
Call sites where a parse failure is a genuine error (sstable_directory.cc,
sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw
explicitly as malformed_sstable_exception using the error string, preserving
the existing error propagation behaviour.
Fix three format string bugs:
- partition_reversing_data_source.cc: _row_start was passed as an
argument but had no {} placeholder in the invariant error message.
Add {} for all three values to show the full diagnostic.
- reader.cc: two "Invalid boundary type" error messages passed the
type value as an argument but had no {} placeholder, so the actual
invalid type was never shown.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The BTI partition index trie writer flushes all buffered nodes at the
end of each SSTable via complete_until_depth(0), called from
bti_partition_index_writer_impl::finish(). This is a tight synchronous
loop that writes trie nodes through file_writer::write(), which uses a
buffered output_stream: individual writes that fit in the buffer are
plain memcpy operations returning a ready future, so .get() never
yields. As a result the reactor can stall for several milliseconds on
large SSTables.
The entire call chain runs inside seastar::async() (via
sstable::write_components()), so seastar::thread::maybe_yield() is
safe to call here. Add it at the top of both tight loops:
- complete_until_depth(), which iterates over trie depth
- lay_out_children(), which iterates over child branches per node
Fixes SCYLLADB-1885
Closesscylladb/scylladb#29798
Commit 8d34127684 ("sstables: clean up TemporaryHashes file in wipe()")
unconditionally calls filename(..., component_type::TemporaryHashes)
inside filesystem_storage::wipe(). However, the TemporaryHashes
component is only registered in the component map of the 'ms' sstable
format. For older formats (ka, la, mc, md, me) the lookup goes through
sstable_version_constants::get_component_map(version).at(...) and throws
std::out_of_range.
The exception is then swallowed by the outer catch(...) in wipe(), which
just logs and ignores. As a side effect, the subsequent
remove_file(new_toc_name) is never reached and the TemporaryTOC
('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink()
of a non-'ms' sstable.
Guard the lookup with get_component_map(version).contains() so the
cleanup is only attempted for formats that actually define the
component.
Add a regression test in test/boost/sstable_directory_test.cc that
creates an 'me'-format sstable, unlinks it and asserts that the sstable
directory is left empty. Without the fix the test fails with a leftover
'me-...-TOC.txt.tmp' file.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1697Closesscylladb/scylladb#29620
mark_for_deletion() only set an in-memory flag; the actual file
deletion ran lazily when the last shared_sstable reference dropped,
leaving a window in which a follow-up scan of the upload directory
(e.g. a second 'nodetool refresh --load-and-stream') could observe a
partially-deleted sstable and fail with malformed_sstable_exception.
Force the unlink to complete before stream() returns. For tablet
streaming, partially-contained sstables span multiple per-tablet
batches, so a defer_unlinking flag postpones the unlink until after
all sstables are streamed; for vnodes and fully-contained sstables are streamed
only once and could be removed just after being streamed.
Added a FIXME on object_storage_base::wipe and strengthened the doc on storage::wipe to
make the never-fails contract explicit
Add a node_owner column (locator::host_id) to system.sstables and
make it part of the partition key, so the primary key becomes
PRIMARY KEY ((table_id, node_owner), generation).
This is the first step toward moving the sstables registry into
system_distributed: once distributed, each node's startup scan
must read only the rows it owns, which requires the owning node
to be part of the partition key. Partitioning by (table_id,
node_owner) turns that scan into a single-partition read of
exactly the local node's rows.
The new column is populated via sstables_manager::get_local_host_id().
No backward compatibility is preserved; the feature is experimental
and gated by keyspace-storage-options.
The partition-key column in system.sstables named 'owner' actually
holds a table_id. Rename the CQL column and the matching C++
parameter and member names so the identifier describes what it
stores. No behavior change.
This prepares the schema for an upcoming node_owner partition-key
column (the local host id), which needs a free name.
Avoid duplicate work when unlink() is called more than once on the
same sstable. This happens when a caller invokes unlink() explicitly
on an sstable that is also marked for deletion: the destructor's
close_files() path would otherwise call unlink() again, re-firing
_on_delete, double-counting _stats.on_delete() and double-invoking
_manager.on_unlink().
parse_assert() accepts an optional `message` parameter that defaults
to nullptr. When the assertion fails and message is nullptr, it is
implicitly converted to sstring via the sstring(const char*) constructor,
which calls strlen(nullptr) -- undefined behavior that manifests as a
segfault in __strlen_evex.
This turns what should be a graceful malformed_sstable_exception into a
fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered
parse_assert() during streaming (in continuous_data_consumer::
fast_forward_to()), causing a crash loop on the affected node.
Fix by guarding the nullptr case with a ternary, passing an empty
sstring() when message is null. on_parse_error() already handles
the empty-message case by substituting "parse_assert() failed".
Fixes: SCYLLADB-1329
Closesscylladb/scylladb#29285
The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage.
The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it.
This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901
The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions.
Closesscylladb/scylladb#28873
* github.com:scylladb/scylladb:
streaming: reject mutation fragments on critical disk utilization
test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection
sstables: clean up TemporaryHashes file in wipe()
sstables: add error injection point in write_components
test/cluster/storage: extract validate_data_existence to module scope
test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity
utils/disk_space_monitor: add error injection to suppress threshold checks
The PR serves two purposes.
First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes.
Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one.
Code cleanup, not backporting
Closesscylladb/scylladb#29513
* github.com:scylladb/scylladb:
sstables: Remove ignore_component_digest_mismatch from sstable_open_config
sstables: Move ignore_component_digest_mismatch initialization to constructor
sstables: Add ignore_component_digest_mismatch to sstables_manager config
The stream_mutation_fragments RPC handler did not check
is_in_critical_disk_utilization_mode before accepting incoming mutation
fragments. This meant load-and-stream (nodetool refresh --load-and-stream)
could push data onto a node at critical disk utilization, potentially
filling the disk completely.
Add a critical disk utilization check in the get_next_mutation_fragment
lambda, throwing critical_disk_utilization_exception when the node is in
critical mode. This mirrors the existing protection in stream_blob.cc.
Also remove the xfail marker from the corresponding test added in the
previous commit.
`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.
This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.
When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.
1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276
* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport
Closesscylladb/scylladb#29257
* github.com:scylladb/scylladb:
db: implement large_data virtual tables with feature flag gating
db: call initialize_virtual_tables from shard 0 only
test: add LargeDataRecords round-trip unit tests
sstables: populate LargeDataRecords from writer
large_data_handler: return above_threshold_result from maybe_record_large_cells
large_data_handler: rename partition_above_threshold to above_threshold_result
sstables: add LargeDataRecords metadata type (tag 13)
sstables: add fmt::formatter for large_data_type
keys: move key_to_str() to keys/keys.hh
The ignore_component_digest_mismatch flag is now initialized at sstable construction
time from sstables_manager::config (which is populated from db::config at boot time).
Remove the flag from sstable_open_config struct and all call sites that were setting
it explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Initialize the ignore_component_digest_mismatch flag from sstables_manager::config
in the sstable constructor initializer list instead of in load(). This ensures the
flag value is set at construction time when the manager config is available, rather
than at load time. Mark the member const to reflect its immutability after construction.
Fixes the bootstrap path which now correctly reads the flag from manager config
initialized from db::config at boot time, instead of using the default value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy the ignore_component_digest_mismatch flag from db::config to sstables_manager::config
during database initialization. This makes the flag available early in the boot process,
before SSTables are loaded, enabling later commits to move the flag initialization from
load-time to construction-time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The TemporaryHashes.db.tmp file is created during SSTable writing to
store intermediate bloom filter hashes and is deleted before the SSTable
is sealed. Since it is not tracked in the TOC, it is also absent from
_recognized_components and all_components().
When an SSTable write fails before sealing (e.g. streaming rejected due
to critical disk utilization), wipe() is called to clean up the partial
SSTable. However, wipe() only iterates over all_components(), so the
TemporaryHashes file was left behind as an orphan.
Previously, the only cleanup mechanism for this file was the
startup-time directory scanner in sstable_directory, which would not
help when the orphan needs to be cleaned up at runtime.
Explicitly remove the TemporaryHashes file in wipe(), ignoring ENOENT
for the common case where the file was already removed before sealing.
Add a `write_components_writer_created` error injection point in
`sstable::write_components()` between writer creation and fragment
consumption.
This injection is needed by the out-of-space streaming test (added in
the next patch) to reliably pause SSTable writing at the right moment:
after the SSTable writer has been created and files exist on disk, but
before mutation fragments are consumed.
Pausing earlier (before writer creation) would not work because there
are no files on disk yet, while pausing later (after consuming fragments)
would be too late to reliably push the node into critical disk utilization.
During compaction (SSTable writing), maintain bounded min-heaps (one per
large_data_type) that collect the top-N above-threshold records. On
stream end, drain all five heaps into a single LargeDataRecords array
and write it into the SSTable's scylla metadata component.
Five separate heaps are used:
- partition_size, row_size, cell_size: ordered by value (size bytes)
- rows_in_partition, elements_in_collection: ordered by elements_count
A new config option 'compaction_large_data_records_per_sstable' (default
10) controls the maximum number of records kept per type.
Change maybe_record_large_cells to return above_threshold_result with
separate booleans for cell size (.size) and collection elements
(.elements) thresholds. This allows the writer to track above_threshold
counts for cell_size and elements_in_collection independently.
Rename partition_above_threshold to above_threshold_result and its
'rows' field to 'elements', making it a generic struct that can be
reused for other large data types (e.g., cells with collection
elements).
Use designated initializers for clarity.
Add a new scylla metadata component LargeDataRecords (tag 13) that
stores per-SSTable top-N large data records. Each record carries:
- large_data_type (partition_size, row_size, cell_size, etc.)
- binary serialized partition key and clustering key
- column name (for cell records)
- value (size in bytes)
- element count (rows or collection elements, type-dependent)
- range tombstones and dead rows (partition records only)
The struct uses disk_string<uint32_t> for key/name fields and is
serialized via the existing describe_type framework into the SSTable
Scylla metadata component.
Add JSON support in scylla-sstable and format documentation.
Add a fmt::formatter specialization for sstables::large_data_type and
use it in scylla-sstable.cc instead of the local to_string() overload,
which is removed.
Config observers run synchronously in a reactor turn and must not
suspend. Split the previous monolithic async update_config() coroutine
into two phases:
Sync (runs in the observer, never suspends):
- S3: atomically swap _cfg (lw_shared_ptr) and set a credentials
refresh flag.
- GCS: install a freshly constructed client; stash the old one for
async cleanup.
- storage_manager: update _object_storage_endpoints and fire the
async cleanup via a gate-guarded background fiber.
Async (gate-guarded background fiber):
- S3: acquire _creds_sem, invalidate and rearm credentials only if
the refresh flag is set.
- GCS: drain and close stashed old clients.
This patch series implements `object_storage_base::clone`, which was previously a stub that aborted at runtime. Clone creates a copy of an sstable under a new generation and is used during compaction.
The implementation uses server-side object copies (S3 CopyObject / GCS Objects: rewrite) and mirrors the filesystem clone semantics: TemporaryTOC is written first to mark the operation as in-progress, component objects are copied, and TemporaryTOC is removed to commit (unless the caller requested the destination be left unsealed).
The first two patches fix pre-existing bugs in the underlying storage clients that were exposed by the new clone code path:
- GCS `copy_object` used the wrong HTTP method (PUT instead of POST) and sent an invalid empty request body.
- S3 `copy_object` silently ignored the abort_source parameter.
1. **gcp_client: fix copy_object request method and body** — Fix two bugs in the GCS rewrite API call.
2. **s3_client: pass through abort_source in copy_object** — Stop ignoring the abort_source parameter.
3. **object_storage: add copy_object to object_storage_client** — New interface method with S3 and GCS implementations.
4. **storage: add make_object_name overload with generation** — Helper for building destination object names with a different generation.
5. **storage: make delete_object const** — Needed by the const clone method.
6. **storage: implement object_storage_base::clone** — The actual clone implementation plus a copy_object wrapper.
7. **test/boost: enable sstable clone tests for S3 and GCS** — Re-enable the previously skipped tests.
A test similar to `sstable_clone_leaving_unsealed_dest_sstable` was added to properly test the sealed/unsealed states for object storage. Works for both S3 and GCS.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1045
Prerequisite: https://github.com/scylladb/scylladb/pull/28790
No need to backport since this code targets future feature
Closesscylladb/scylladb#29166
* github.com:scylladb/scylladb:
compaction_test: enable sstable clone tests for S3 and GCS
storage: implement object_storage_base::clone
storage: make delete_object const in object_storage_base
storage: add make_object_name overload with generation
sstables: add get_format() accessor to sstable
object_storage: add copy_object to object_storage_client
s3_client: pass through abort_source in copy_object
gcp_client: fix copy_object request method and body
Implement the clone method for object_storage_base, which creates
a copy of an sstable with a new generation using server-side object
copies. Also add a const copy_object convenience wrapper, similar
to the existing put_object and delete_object wrappers.
A dedicated test for the new object storage clone path will be
added in the following commit. The preexisting local-filesystem
clone is already covered by the sstable_clone_leaving_unsealed_dest_sstable
test.
Add a make_object_name overload that accepts a target
generation parameter for constructing object names with
a generation different from the source sstable's own.
Refactor the original make_object_name to delegate to
the new overload, eliminating code duplication.
This is needed by clone to build destination object
names for the new generation.
Add a public get_format() accessor for the _format member, following
the same pattern as the existing get_version(). This allows storage
implementations to access the sstable format without reaching into
private members, and is needed by the upcoming object_storage_base::clone
to construct entry_descriptor for the sstables registry.
Add a copy_object method to the object_storage_client
interface for server-side object copies, with
implementations for both S3 and GCS wrappers.
The S3 wrapper delegates to s3::client::copy_object.
The GCS wrapper delegates to gcp::storage::client's
cross-bucket copy_object overload.
This is a prerequisite for implementing sstable clone
on object storage.
Add a test that verifies filesystem_storage::clone preserves the sstable
state: an sstable in staging is cloned to a new generation, the clone is
re-loaded from the staging directory, and its state is asserted to still
be staging.
The change proves that https://scylladb.atlassian.net/browse/SCYLLADB-1205
is invalid, and can be closed.
* No functional change and no backport needed
Closesscylladb/scylladb#29209
* github.com:scylladb/scylladb:
test: add test_sstable_clone_preserves_staging_state
test: derive sstable state from directory in test_env::make_sstable
sstables: log debug message in filesystem_storage::clone
Add an `exists` method to the storage abstraction to allow S3, GCS,
and local storage implementations to check whether an sstable
component is present.
Switch _promoted_indexes storage in partition_index_page from
managed_vector to chunked_managed_vector to avoid large contiguous
allocations.
Avoid allocation failure (or crashes with --abort-on-internal-error)
when large partitions have enough promoted index entries to trigger a
large allocation with managed_vector.
Fixes: SCYLLADB-1315
Closesscylladb/scylladb#29283
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.
The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.
Fixes#29060
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29061
Two issues found in the lister returned by gs_client_wrapper::make_object_lister()
Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access
Fixes#29058
The code appeared in 2026.1, worth fixing it there as well
Closesscylladb/scylladb#29059
* github.com:scylladb/scylladb:
sstables: Fix object storage lister not resetting position in batch vector
sstables: Fix object storage lister skipping entries when filter is active
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make the new format a new default for new clusters by naming ms in the default scylla.yaml.
New functionality. No backport needed.
This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix.
Closesscylladb/scylladb#28960
* github.com:scylladb/scylladb:
db/config: announce ms format as highest supported
db/config: enable `ms` sstable format by default
cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
api/system: add /system/chosen_sstable_version
test/cluster/dtest: reduce num_tokens to 16