mark_for_deletion() only set an in-memory flag; the actual file
deletion ran lazily when the last shared_sstable reference dropped,
leaving a window in which a follow-up scan of the upload directory
(e.g. a second 'nodetool refresh --load-and-stream') could observe a
partially-deleted sstable and fail with malformed_sstable_exception.
Force the unlink to complete before stream() returns. For tablet
streaming, partially-contained sstables span multiple per-tablet
batches, so a defer_unlinking flag postpones the unlink until after
all sstables are streamed; for vnodes and fully-contained sstables are streamed
only once and could be removed just after being streamed.
Added a FIXME on object_storage_base::wipe and strengthened the doc on storage::wipe to
make the never-fails contract explicit
(cherry picked from commit 784127c40b)
Avoid duplicate work when unlink() is called more than once on the
same sstable. This happens when a caller invokes unlink() explicitly
on an sstable that is also marked for deletion: the destructor's
close_files() path would otherwise call unlink() again, re-firing
_on_delete, double-counting _stats.on_delete() and double-invoking
_manager.on_unlink().
(cherry picked from commit 7cdf215999)
The BTI partition index trie writer flushes all buffered nodes at the
end of each SSTable via complete_until_depth(0), called from
bti_partition_index_writer_impl::finish(). This is a tight synchronous
loop that writes trie nodes through file_writer::write(), which uses a
buffered output_stream: individual writes that fit in the buffer are
plain memcpy operations returning a ready future, so .get() never
yields. As a result the reactor can stall for several milliseconds on
large SSTables.
The entire call chain runs inside seastar::async() (via
sstable::write_components()), so seastar::thread::maybe_yield() is
safe to call here. Add it at the top of both tight loops:
- complete_until_depth(), which iterates over trie depth
- lay_out_children(), which iterates over child branches per node
Fixes SCYLLADB-1885
Closesscylladb/scylladb#29798
(cherry picked from commit d0813769ec)
Closesscylladb/scylladb#29810
Commit 8d34127684 ("sstables: clean up TemporaryHashes file in wipe()")
unconditionally calls filename(..., component_type::TemporaryHashes)
inside filesystem_storage::wipe(). However, the TemporaryHashes
component is only registered in the component map of the 'ms' sstable
format. For older formats (ka, la, mc, md, me) the lookup goes through
sstable_version_constants::get_component_map(version).at(...) and throws
std::out_of_range.
The exception is then swallowed by the outer catch(...) in wipe(), which
just logs and ignores. As a side effect, the subsequent
remove_file(new_toc_name) is never reached and the TemporaryTOC
('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink()
of a non-'ms' sstable.
Guard the lookup with get_component_map(version).contains() so the
cleanup is only attempted for formats that actually define the
component.
Add a regression test in test/boost/sstable_directory_test.cc that
creates an 'me'-format sstable, unlinks it and asserts that the sstable
directory is left empty. Without the fix the test fails with a leftover
'me-...-TOC.txt.tmp' file.
Fixes: SCYLLADB-1767
Closesscylladb/scylladb#29620
(cherry picked from commit 7e14ea5ac8)
Closesscylladb/scylladb#29692
parse_assert() accepts an optional `message` parameter that defaults
to nullptr. When the assertion fails and message is nullptr, it is
implicitly converted to sstring via the sstring(const char*) constructor,
which calls strlen(nullptr) -- undefined behavior that manifests as a
segfault in __strlen_evex.
This turns what should be a graceful malformed_sstable_exception into a
fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered
parse_assert() during streaming (in continuous_data_consumer::
fast_forward_to()), causing a crash loop on the affected node.
Fix by guarding the nullptr case with a ternary, passing an empty
sstring() when message is null. on_parse_error() already handles
the empty-message case by substituting "parse_assert() failed".
Fixes: SCYLLADB-1329
Closesscylladb/scylladb#29285
The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage.
The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it.
This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901
The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions.
Closesscylladb/scylladb#28873
* github.com:scylladb/scylladb:
streaming: reject mutation fragments on critical disk utilization
test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection
sstables: clean up TemporaryHashes file in wipe()
sstables: add error injection point in write_components
test/cluster/storage: extract validate_data_existence to module scope
test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity
utils/disk_space_monitor: add error injection to suppress threshold checks
The PR serves two purposes.
First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes.
Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one.
Code cleanup, not backporting
Closesscylladb/scylladb#29513
* github.com:scylladb/scylladb:
sstables: Remove ignore_component_digest_mismatch from sstable_open_config
sstables: Move ignore_component_digest_mismatch initialization to constructor
sstables: Add ignore_component_digest_mismatch to sstables_manager config
The stream_mutation_fragments RPC handler did not check
is_in_critical_disk_utilization_mode before accepting incoming mutation
fragments. This meant load-and-stream (nodetool refresh --load-and-stream)
could push data onto a node at critical disk utilization, potentially
filling the disk completely.
Add a critical disk utilization check in the get_next_mutation_fragment
lambda, throwing critical_disk_utilization_exception when the node is in
critical mode. This mirrors the existing protection in stream_blob.cc.
Also remove the xfail marker from the corresponding test added in the
previous commit.
`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.
This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.
When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.
1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276
* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport
Closesscylladb/scylladb#29257
* github.com:scylladb/scylladb:
db: implement large_data virtual tables with feature flag gating
db: call initialize_virtual_tables from shard 0 only
test: add LargeDataRecords round-trip unit tests
sstables: populate LargeDataRecords from writer
large_data_handler: return above_threshold_result from maybe_record_large_cells
large_data_handler: rename partition_above_threshold to above_threshold_result
sstables: add LargeDataRecords metadata type (tag 13)
sstables: add fmt::formatter for large_data_type
keys: move key_to_str() to keys/keys.hh
The ignore_component_digest_mismatch flag is now initialized at sstable construction
time from sstables_manager::config (which is populated from db::config at boot time).
Remove the flag from sstable_open_config struct and all call sites that were setting
it explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Initialize the ignore_component_digest_mismatch flag from sstables_manager::config
in the sstable constructor initializer list instead of in load(). This ensures the
flag value is set at construction time when the manager config is available, rather
than at load time. Mark the member const to reflect its immutability after construction.
Fixes the bootstrap path which now correctly reads the flag from manager config
initialized from db::config at boot time, instead of using the default value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy the ignore_component_digest_mismatch flag from db::config to sstables_manager::config
during database initialization. This makes the flag available early in the boot process,
before SSTables are loaded, enabling later commits to move the flag initialization from
load-time to construction-time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The TemporaryHashes.db.tmp file is created during SSTable writing to
store intermediate bloom filter hashes and is deleted before the SSTable
is sealed. Since it is not tracked in the TOC, it is also absent from
_recognized_components and all_components().
When an SSTable write fails before sealing (e.g. streaming rejected due
to critical disk utilization), wipe() is called to clean up the partial
SSTable. However, wipe() only iterates over all_components(), so the
TemporaryHashes file was left behind as an orphan.
Previously, the only cleanup mechanism for this file was the
startup-time directory scanner in sstable_directory, which would not
help when the orphan needs to be cleaned up at runtime.
Explicitly remove the TemporaryHashes file in wipe(), ignoring ENOENT
for the common case where the file was already removed before sealing.
Add a `write_components_writer_created` error injection point in
`sstable::write_components()` between writer creation and fragment
consumption.
This injection is needed by the out-of-space streaming test (added in
the next patch) to reliably pause SSTable writing at the right moment:
after the SSTable writer has been created and files exist on disk, but
before mutation fragments are consumed.
Pausing earlier (before writer creation) would not work because there
are no files on disk yet, while pausing later (after consuming fragments)
would be too late to reliably push the node into critical disk utilization.
During compaction (SSTable writing), maintain bounded min-heaps (one per
large_data_type) that collect the top-N above-threshold records. On
stream end, drain all five heaps into a single LargeDataRecords array
and write it into the SSTable's scylla metadata component.
Five separate heaps are used:
- partition_size, row_size, cell_size: ordered by value (size bytes)
- rows_in_partition, elements_in_collection: ordered by elements_count
A new config option 'compaction_large_data_records_per_sstable' (default
10) controls the maximum number of records kept per type.
Change maybe_record_large_cells to return above_threshold_result with
separate booleans for cell size (.size) and collection elements
(.elements) thresholds. This allows the writer to track above_threshold
counts for cell_size and elements_in_collection independently.
Rename partition_above_threshold to above_threshold_result and its
'rows' field to 'elements', making it a generic struct that can be
reused for other large data types (e.g., cells with collection
elements).
Use designated initializers for clarity.
Add a new scylla metadata component LargeDataRecords (tag 13) that
stores per-SSTable top-N large data records. Each record carries:
- large_data_type (partition_size, row_size, cell_size, etc.)
- binary serialized partition key and clustering key
- column name (for cell records)
- value (size in bytes)
- element count (rows or collection elements, type-dependent)
- range tombstones and dead rows (partition records only)
The struct uses disk_string<uint32_t> for key/name fields and is
serialized via the existing describe_type framework into the SSTable
Scylla metadata component.
Add JSON support in scylla-sstable and format documentation.
Add a fmt::formatter specialization for sstables::large_data_type and
use it in scylla-sstable.cc instead of the local to_string() overload,
which is removed.
Config observers run synchronously in a reactor turn and must not
suspend. Split the previous monolithic async update_config() coroutine
into two phases:
Sync (runs in the observer, never suspends):
- S3: atomically swap _cfg (lw_shared_ptr) and set a credentials
refresh flag.
- GCS: install a freshly constructed client; stash the old one for
async cleanup.
- storage_manager: update _object_storage_endpoints and fire the
async cleanup via a gate-guarded background fiber.
Async (gate-guarded background fiber):
- S3: acquire _creds_sem, invalidate and rearm credentials only if
the refresh flag is set.
- GCS: drain and close stashed old clients.
This patch series implements `object_storage_base::clone`, which was previously a stub that aborted at runtime. Clone creates a copy of an sstable under a new generation and is used during compaction.
The implementation uses server-side object copies (S3 CopyObject / GCS Objects: rewrite) and mirrors the filesystem clone semantics: TemporaryTOC is written first to mark the operation as in-progress, component objects are copied, and TemporaryTOC is removed to commit (unless the caller requested the destination be left unsealed).
The first two patches fix pre-existing bugs in the underlying storage clients that were exposed by the new clone code path:
- GCS `copy_object` used the wrong HTTP method (PUT instead of POST) and sent an invalid empty request body.
- S3 `copy_object` silently ignored the abort_source parameter.
1. **gcp_client: fix copy_object request method and body** — Fix two bugs in the GCS rewrite API call.
2. **s3_client: pass through abort_source in copy_object** — Stop ignoring the abort_source parameter.
3. **object_storage: add copy_object to object_storage_client** — New interface method with S3 and GCS implementations.
4. **storage: add make_object_name overload with generation** — Helper for building destination object names with a different generation.
5. **storage: make delete_object const** — Needed by the const clone method.
6. **storage: implement object_storage_base::clone** — The actual clone implementation plus a copy_object wrapper.
7. **test/boost: enable sstable clone tests for S3 and GCS** — Re-enable the previously skipped tests.
A test similar to `sstable_clone_leaving_unsealed_dest_sstable` was added to properly test the sealed/unsealed states for object storage. Works for both S3 and GCS.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1045
Prerequisite: https://github.com/scylladb/scylladb/pull/28790
No need to backport since this code targets future feature
Closesscylladb/scylladb#29166
* github.com:scylladb/scylladb:
compaction_test: enable sstable clone tests for S3 and GCS
storage: implement object_storage_base::clone
storage: make delete_object const in object_storage_base
storage: add make_object_name overload with generation
sstables: add get_format() accessor to sstable
object_storage: add copy_object to object_storage_client
s3_client: pass through abort_source in copy_object
gcp_client: fix copy_object request method and body
Implement the clone method for object_storage_base, which creates
a copy of an sstable with a new generation using server-side object
copies. Also add a const copy_object convenience wrapper, similar
to the existing put_object and delete_object wrappers.
A dedicated test for the new object storage clone path will be
added in the following commit. The preexisting local-filesystem
clone is already covered by the sstable_clone_leaving_unsealed_dest_sstable
test.
Add a make_object_name overload that accepts a target
generation parameter for constructing object names with
a generation different from the source sstable's own.
Refactor the original make_object_name to delegate to
the new overload, eliminating code duplication.
This is needed by clone to build destination object
names for the new generation.
Add a public get_format() accessor for the _format member, following
the same pattern as the existing get_version(). This allows storage
implementations to access the sstable format without reaching into
private members, and is needed by the upcoming object_storage_base::clone
to construct entry_descriptor for the sstables registry.
Add a copy_object method to the object_storage_client
interface for server-side object copies, with
implementations for both S3 and GCS wrappers.
The S3 wrapper delegates to s3::client::copy_object.
The GCS wrapper delegates to gcp::storage::client's
cross-bucket copy_object overload.
This is a prerequisite for implementing sstable clone
on object storage.
Add a test that verifies filesystem_storage::clone preserves the sstable
state: an sstable in staging is cloned to a new generation, the clone is
re-loaded from the staging directory, and its state is asserted to still
be staging.
The change proves that https://scylladb.atlassian.net/browse/SCYLLADB-1205
is invalid, and can be closed.
* No functional change and no backport needed
Closesscylladb/scylladb#29209
* github.com:scylladb/scylladb:
test: add test_sstable_clone_preserves_staging_state
test: derive sstable state from directory in test_env::make_sstable
sstables: log debug message in filesystem_storage::clone
Add an `exists` method to the storage abstraction to allow S3, GCS,
and local storage implementations to check whether an sstable
component is present.
Switch _promoted_indexes storage in partition_index_page from
managed_vector to chunked_managed_vector to avoid large contiguous
allocations.
Avoid allocation failure (or crashes with --abort-on-internal-error)
when large partitions have enough promoted index entries to trigger a
large allocation with managed_vector.
Fixes: SCYLLADB-1315
Closesscylladb/scylladb#29283
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.
The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.
Fixes#29060
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29061
Two issues found in the lister returned by gs_client_wrapper::make_object_lister()
Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access
Fixes#29058
The code appeared in 2026.1, worth fixing it there as well
Closesscylladb/scylladb#29059
* github.com:scylladb/scylladb:
sstables: Fix object storage lister not resetting position in batch vector
sstables: Fix object storage lister skipping entries when filter is active
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make the new format a new default for new clusters by naming ms in the default scylla.yaml.
New functionality. No backport needed.
This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix.
Closesscylladb/scylladb#28960
* github.com:scylladb/scylladb:
db/config: announce ms format as highest supported
db/config: enable `ms` sstable format by default
cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
api/system: add /system/chosen_sstable_version
test/cluster/dtest: reduce num_tokens to 16
This change reduces the cost of partition index page construction and
LSA migration. This is achieved by several things working together:
- index entries don't store keys as separate small objects (managed_bytes)
They are written into one managed_bytes fragmented storage, entries
hold offset into it.
Before, we paid 16 bytes for managed_bytes plus LSA descriptor for
the storage (1 byte) plus back-reference in the storage (8 bytes),
so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16
bytes, that's a reduction from 31 bytes to 20 bytes per key.
- index entries and key storage are now trivially moveable, so LSA
migration can use memcpy() which amortizes the cost per key.
memcpy().
LSA eviction is now trivial and constant time for the whole page
regardless of the number of entries. Page eviction dropped from
14 us to 1 us.
This improves throughput in a CPU-bound miss-heavy read workload where
the partition index doesn't fit in memory.
scylla perf-simple-query -c1 -m200M --partitions=1000000
Before:
15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors)
15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors)
15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors)
15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors)
15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors)
15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors)
15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)
After:
21620.18 tps (148.4 allocs/op, 13.4 logallocs/op, 43.7 tasks/op, 176817 insns/op, 153183 cycles/op, 0 errors)
20644.03 tps (149.8 allocs/op, 13.5 logallocs/op, 44.3 tasks/op, 187941 insns/op, 160409 cycles/op, 0 errors)
20588.06 tps (150.1 allocs/op, 13.5 logallocs/op, 44.5 tasks/op, 188090 insns/op, 160818 cycles/op, 0 errors)
20789.29 tps (149.5 allocs/op, 13.5 logallocs/op, 44.2 tasks/op, 186495 insns/op, 159382 cycles/op, 0 errors)
20977.89 tps (149.5 allocs/op, 13.4 logallocs/op, 44.2 tasks/op, 183969 insns/op, 158140 cycles/op, 0 errors)
21125.34 tps (149.1 allocs/op, 13.4 logallocs/op, 44.1 tasks/op, 183204 insns/op, 156925 cycles/op, 0 errors)
21244.42 tps (148.6 allocs/op, 13.4 logallocs/op, 43.8 tasks/op, 181276 insns/op, 155973 cycles/op, 0 errors)
Mostly because the index now fits in memory.
When it doesn't, the benefits are still visible due to lower LSA overhead.
Densely populated pages have no promoted index (small partitions), so
we can save space in such workloads by keeping promoted index in a
separate vector.
For workloads which do have a promoted index, pages have only one
partition. There aren't many such pages and they are long-lived, so
the extra allocation of the vector is amortized.
promoted_index class is removed, and replaced with equivalent
parsed_promoted_index_entry for simplicity. Because it's removed,
make_cursor() is moved into the index_reader class.
Reducing the size of index_entry is important for performence if pages
are densly populated. It helps to reduce LSA allocator pressure and
compaction/eviction speed.
This change, combined with the earlier change "Shave-off 16 bytes from
index_entry by using raw_token", gives significant improvement in
throughput in perf_simple_query run where the index doesn't fit in
memory:
scylla perf-simple-query -c1 -m200M --partitions=1000000
Before:
9714.78 tps (170.9 allocs/op, 16.9 logallocs/op, 55.3 tasks/op, 494788 insns/op, 343920 cycles/op, 0 errors)
9603.13 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 502358 insns/op, 348344 cycles/op, 0 errors)
9621.43 tps (171.9 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 500612 insns/op, 347508 cycles/op, 0 errors)
9597.75 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 501428 insns/op, 348604 cycles/op, 0 errors)
9615.54 tps (171.6 allocs/op, 16.9 logallocs/op, 55.6 tasks/op, 501313 insns/op, 347935 cycles/op, 0 errors)
9577.03 tps (171.8 allocs/op, 17.0 logallocs/op, 55.7 tasks/op, 503283 insns/op, 349251 cycles/op, 0 errors)
After:
15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors)
15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors)
15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors)
15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors)
15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors)
15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors)
15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)
The std::optional<> adds 8 bytes.
And dht::token adds 8 bytes due to _kind, which in this case is always
kind::key.
The size changd from 56 to 48 bytes.
vector
The lister loop in get() pre-fetches records in batches and keeps them
in a _info vector, iterating over it with the help of _pos cursor. When
the vector is re-read, the cursor must be reset too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lister loop in get() method looks weird. It uses do-while(false)
loop and calls continue; inside when filter asks to skip a entry.
Skipping, thus, aborts the whole thing and EOF-s, which is not what's
supposed to happen.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode
Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception.
Depends on https://github.com/scylladb/scylladb/pull/28338
Backport is not required, this is new feature
Fixes https://github.com/scylladb/scylladb/issues/20103Closesscylladb/scylladb#28761
* github.com:scylladb/scylladb:
test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade
sstables: propagate ignore_component_digest_mismatch config to all load sites
sstables: add option to ignore component digest mismatches
sstable_compaction_test: Add scrub validate test for corrupted index
sstables: add tests for component digest validation on corrupted SSTables
sstables: validate index components digests during SSTable scrub in validate mode
sstables: verify component digests on SSTable load
sstables: add digest_file_random_access_reader for CRC32 digest computation
Introduced by 54bddeb3b5, the yield was
added to write_cell(), to also help the general case where there is no
collection. Arguably this was unnecessary and this patch moves the yield
to write_collection(), to the cell write loop instead, so regular cells
don't have to poll the preempt flag.
Closesscylladb/scylladb#29013
Add `ignore_component_digest_mismatch` option to `sstable_open_config`
that logs a warning instead of throwing `malformed_sstable_exception`
on component digest mismatch. This is useful for recovering sstables
with corrupted non-vital components or working around bugs in digest
calculation.
Expose the option in scylla-sstable via the
`--ignore-component-digest-mismatch` flag for the upgrade operation.
Add integrity verification for SSTable component files by validating
their CRC32 digests against the expected values stored in Scylla
metadata during SSTable loading.
The following components are validated on load: TOC, Scylla metadata,
CompressionInfo, Statistics, Summary, and Filter.
Add a new reader that wraps file_random_access_reader and computes a
running CRC32 digest over the data as it is read. The digest accumulates
across sequential read_exactly() calls and is reset on seek(), since a
non-sequential read invalidates the running checksum.