scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 15:52:13 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	1c0f8ab66e	Merge 'sstables: introduce --abort-on-malformed-sstable-error' from Botond Dénes When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site. This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths: - Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`) - Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`) - `parse_assert()` failures (via `on_parse_error()`) - BTI parse errors (via `on_bti_parse_error()`) The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure. The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption. Commit breakdown: 1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc` 2. `on_parse_error()` and `on_bti_parse_error()` check the new flag 3. All ~50 `throw malformed_sstable_exception(...)` sites migrated 4. Both `throw bufsize_mismatch_exception(...)` sites migrated Refs: SCYLLADB-1087 Backport: new feature, no backport Closes scylladb/scylladb#29324 * github.com:scylladb/scylladb: sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception() sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose sstables: introduce --abort-on-malformed-sstable-error infrastructure sstables: refactor parse_path() to return std::expected<> instead of throwing	2026-05-12 12:38:25 +03:00
Dimitrios Symonidis	94bc0245f9	sstables, utils/s3: reuse caller-provided file in s3_storage::make_source s3_storage::make_source previously ignored its file f parameter and constructed a fresh s3::client::readable_file per call. The new file's _stats cache was empty, so the first dma_read_bulk issued a HEAD via maybe_update_stats just to learn the object size before the ranged GET -- one ~50 ms RTT per uncached read. The file f passed in by the two callers (sstable::data_stream for Data.db reads and index_reader::make_context for Index.db reads) already wraps the sstable's _data_file or _index_file. Those file objects had their stats populated at sstable open time by update_info_for_opened_data, and they were wrapped with the configured file_io_extensions when opened via open_component. Reusing them is exactly what filesystem_storage::make_source does (one-line make_file_data_source over f), so the s3 path simply matches it. readable_file::size() is also updated to route through maybe_update_stats(), so a .size() call populates the _stats cache the same way .stat() does -- preventing a redundant HEAD on the first subsequent read of components opened with .size() (Index, Partitions, Rows in update_info_for_opened_data). Closes scylladb/scylladb#29766 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 12:38:24 +03:00
Botond Dénes	cf37f541a0	Merge ' sstables_loader: ensure upload directory is empty when load_and_stream returns' from Taras Veretilnyk After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`) returns success, source sstable files in the `upload/` directory may still be on disk. `mark_for_deletion()` only sets an in-memory flag; the actual file deletion runs lazily when the last `shared_sstable` reference drops. This leaves a window between API success and physical deletion where a follow-up scan of the upload directory can detected sstables that will be deleted soon. This might cause failure because SSTable will be already wiped during processing. For fix: Force unlink to complete before `stream()` returns, so the upload directory is in a consistent state by the time the API reports success. For tablet streaming, partially-contained sstables participate in multiple per-tablet batches; eagerly unlinking after each batch would break the next batch that still needs to read the file. A `defer_unlinking` flag on the streamer postpones the explicit unlink until after all batches complete (called once at the end of `tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of `stream_sstable_mutations`. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647 Backport is required, as it is a bug fix that was introduced in `517a4dc4df`. Closes scylladb/scylladb#29599 * github.com:scylladb/scylladb: sstables_loader: synchronously unlink streamed sstables before returning sstables: make sstable::unlink() idempotent	2026-05-11 14:43:46 +03:00
Botond Dénes	ad7ac62835	Merge ' Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key' from Dimitrios Symonidis Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation). This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562 No need to backport this, keyspace over object storage is experimental feature Closes scylladb/scylladb#29659 * github.com:scylladb/scylladb: db, sstables: add node_owner to sstables registry primary key db, sstables: rename sstables registry column owner to table_id	2026-05-11 14:08:19 +03:00
Botond Dénes	2edfb91070	sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception() Replace the two remaining direct 'throw bufsize_mismatch_exception(...)' call sites with the new throw_bufsize_mismatch_exception() helper, which routes through throw_malformed_sstable_exception() and thus also respects the --abort-on-malformed-sstable-error flag. Affected files: - sstables/sstables.cc (1 site, in check_buf_size()) - sstables/m_format_read_helpers.cc (1 site, in check_buf_size())	2026-05-11 11:58:14 +03:00
Botond Dénes	d65c1523c2	sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() Replace all direct 'throw malformed_sstable_exception(...)' call sites with the new throw_malformed_sstable_exception() helper, which respects the --abort-on-malformed-sstable-error flag.	2026-05-11 11:58:14 +03:00
Botond Dénes	84c27658d9	sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error Both functions now check abort_on_malformed_sstable_error() first. If set, they log the error and call std::abort() directly, generating a coredump. Otherwise they fall through to the existing on_internal_error() path, which is in turn controlled by --abort-on-internal-error.	2026-05-11 11:58:14 +03:00
Botond Dénes	4ebcc002d6	sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose Add scoped_no_abort_on_malformed_sstable_error RAII guard (modeled after seastar::testing::scoped_no_abort_on_internal_error) and use it in all tests that intentionally corrupt sstables and expect malformed_sstable_exception to be thrown rather than the process aborting.	2026-05-11 11:58:14 +03:00
Botond Dénes	f6dc2cb5f8	sstables: introduce --abort-on-malformed-sstable-error infrastructure Add the --abort-on-malformed-sstable-error command-line option and the supporting infrastructure. When set, any malformed sstable error will abort the process and generate a coredump instead of throwing an exception. This is useful for debugging memory corruption that may manifest as apparent sstable corruption. The implementation introduces: - throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception() helper functions in sstables/sstables.cc, which check the new flag and either abort (with logging) or throw the appropriate exception. - set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error() to control the per-process atomic flag. - abort_on_malformed_sstable_error config option (LiveUpdate, default false) wired up in main.cc alongside abort_on_internal_error. Call-site migration will follow in subsequent commits.	2026-05-11 11:58:14 +03:00
Botond Dénes	c3daa6379c	sstables: refactor parse_path() to return std::expected<> instead of throwing make_entry_descriptor() and the two overloads of parse_path() used to signal parse failures by throwing malformed_sstable_exception, which made parse_path() expensive to use as a probe (e.g. to classify directory entries). Change make_entry_descriptor() and both parse_path() overloads to return std::expected<T, sstring>, where the sstring carries the error message on failure, eliminating the exception overhead at probe call sites. Call sites that previously caught malformed_sstable_exception to treat the path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc, tools/scylla-sstable.cc) now check the expected result directly. Call sites where a parse failure is a genuine error (sstable_directory.cc, sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw explicitly as malformed_sstable_exception using the error string, preserving the existing error propagation behaviour.	2026-05-11 11:58:14 +03:00
Yaniv Kaul	e29f59347b	sstables: fix missing format placeholders in error messages Fix three format string bugs: - partition_reversing_data_source.cc: _row_start was passed as an argument but had no {} placeholder in the invariant error message. Add {} for all three values to show the full diagnostic. - reader.cc: two "Invalid boundary type" error messages passed the type value as an argument but had no {} placeholder, so the actual invalid type was never shown. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2026-05-10 17:51:19 +03:00
Botond Dénes	d0813769ec	sstables/trie: add preemption points in trie_writer The BTI partition index trie writer flushes all buffered nodes at the end of each SSTable via complete_until_depth(0), called from bti_partition_index_writer_impl::finish(). This is a tight synchronous loop that writes trie nodes through file_writer::write(), which uses a buffered output_stream: individual writes that fit in the buffer are plain memcpy operations returning a ready future, so .get() never yields. As a result the reactor can stall for several milliseconds on large SSTables. The entire call chain runs inside seastar::async() (via sstable::write_components()), so seastar::thread::maybe_yield() is safe to call here. Add it at the top of both tight loops: - complete_until_depth(), which iterates over trie depth - lay_out_children(), which iterates over child branches per node Fixes SCYLLADB-1885 Closes scylladb/scylladb#29798	2026-05-10 11:30:59 +03:00
Łukasz Paszkowski	7e14ea5ac8	sstables: only wipe TemporaryHashes for sstable formats that have it Commit `8d34127684` ("sstables: clean up TemporaryHashes file in wipe()") unconditionally calls filename(..., component_type::TemporaryHashes) inside filesystem_storage::wipe(). However, the TemporaryHashes component is only registered in the component map of the 'ms' sstable format. For older formats (ka, la, mc, md, me) the lookup goes through sstable_version_constants::get_component_map(version).at(...) and throws std::out_of_range. The exception is then swallowed by the outer catch(...) in wipe(), which just logs and ignores. As a side effect, the subsequent remove_file(new_toc_name) is never reached and the TemporaryTOC ('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink() of a non-'ms' sstable. Guard the lookup with get_component_map(version).contains() so the cleanup is only attempted for formats that actually define the component. Add a regression test in test/boost/sstable_directory_test.cc that creates an 'me'-format sstable, unlinks it and asserts that the sstable directory is left empty. Without the fix the test fails with a leftover 'me-...-TOC.txt.tmp' file. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1697 Closes scylladb/scylladb#29620	2026-04-29 08:06:36 +03:00
Taras Veretilnyk	784127c40b	sstables_loader: synchronously unlink streamed sstables before returning mark_for_deletion() only set an in-memory flag; the actual file deletion ran lazily when the last shared_sstable reference dropped, leaving a window in which a follow-up scan of the upload directory (e.g. a second 'nodetool refresh --load-and-stream') could observe a partially-deleted sstable and fail with malformed_sstable_exception. Force the unlink to complete before stream() returns. For tablet streaming, partially-contained sstables span multiple per-tablet batches, so a defer_unlinking flag postpones the unlink until after all sstables are streamed; for vnodes and fully-contained sstables are streamed only once and could be removed just after being streamed. Added a FIXME on object_storage_base::wipe and strengthened the doc on storage::wipe to make the never-fails contract explicit	2026-04-28 14:52:28 +02:00
Dimitrios Symonidis	c40842f60a	db, sstables: add node_owner to sstables registry primary key Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomes PRIMARY KEY ((table_id, node_owner), generation). This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows. The new column is populated via sstables_manager::get_local_host_id(). No backward compatibility is preserved; the feature is experimental and gated by keyspace-storage-options.	2026-04-24 16:41:09 +02:00
Dimitrios Symonidis	ce78c5113e	db, sstables: rename sstables registry column owner to table_id The partition-key column in system.sstables named 'owner' actually holds a table_id. Rename the CQL column and the matching C++ parameter and member names so the identifier describes what it stores. No behavior change. This prepares the schema for an upcoming node_owner partition-key column (the local host id), which needs a free name.	2026-04-24 16:24:07 +02:00
Taras Veretilnyk	7cdf215999	sstables: make sstable::unlink() idempotent Avoid duplicate work when unlink() is called more than once on the same sstable. This happens when a caller invokes unlink() explicitly on an sstable that is also marked for deletion: the destructor's close_files() path would otherwise call unlink() again, re-firing _on_delete, double-counting _stats.on_delete() and double-invoking _manager.on_unlink().	2026-04-21 22:41:02 +02:00
Botond Dénes	cfebe17592	sstables: fix segfault in parse_assert() when message is nullptr parse_assert() accepts an optional `message` parameter that defaults to nullptr. When the assertion fails and message is nullptr, it is implicitly converted to sstring via the sstring(const char*) constructor, which calls strlen(nullptr) -- undefined behavior that manifests as a segfault in __strlen_evex. This turns what should be a graceful malformed_sstable_exception into a fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered parse_assert() during streaming (in continuous_data_consumer:: fast_forward_to()), causing a crash loop on the affected node. Fix by guarding the nullptr case with a ternary, passing an empty sstring() when message is null. on_parse_error() already handles the empty-message case by substituting "parse_assert() failed". Fixes: SCYLLADB-1329 Closes scylladb/scylladb#29285	2026-04-21 12:40:33 +02:00
Botond Dénes	69c58c6589	Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage. The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it. This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901 The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions. Closes scylladb/scylladb#28873 * github.com:scylladb/scylladb: streaming: reject mutation fragments on critical disk utilization test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection sstables: clean up TemporaryHashes file in wipe() sstables: add error injection point in write_components test/cluster/storage: extract validate_data_existence to module scope test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity utils/disk_space_monitor: add error injection to suppress threshold checks	2026-04-20 17:56:36 +03:00
Botond Dénes	57f8be49e9	Merge 'Move ignore_component_digest_mismatch flag on sstables_manager' from Pavel Emelyanov The PR serves two purposes. First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes. Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one. Code cleanup, not backporting Closes scylladb/scylladb#29513 * github.com:scylladb/scylladb: sstables: Remove ignore_component_digest_mismatch from sstable_open_config sstables: Move ignore_component_digest_mismatch initialization to constructor sstables: Add ignore_component_digest_mismatch to sstables_manager config	2026-04-17 12:54:17 +03:00
Łukasz Paszkowski	4657d9e32c	streaming: reject mutation fragments on critical disk utilization The stream_mutation_fragments RPC handler did not check is_in_critical_disk_utilization_mode before accepting incoming mutation fragments. This meant load-and-stream (nodetool refresh --load-and-stream) could push data onto a node at critical disk utilization, potentially filling the disk completely. Add a critical disk utilization check in the get_next_mutation_fragment lambda, throwing critical_disk_utilization_exception when the node is in critical mode. This mirrors the existing protection in stream_blob.cc. Also remove the xfail marker from the corresponding test added in the previous commit.	2026-04-17 09:31:26 +02:00
Botond Dénes	88a8324e68	erge 'db: store large data records in SSTable metadata and serve via virtual tables' from Benny Halevy `system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables. This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades. When the cluster feature is enabled, each node drops the old system large_ tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables. Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records. 1. keys: move key_to_str() to keys/keys.hh — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable 2. sstables: add LargeDataRecords metadata type (tag 13) — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation 3. large_data_handler: rename partition_above_threshold to above_threshold_result — generalize the struct for reuse 4. large_data_handler: return above_threshold_result from maybe_record_large_cells — separate booleans for cell size vs collection elements thresholds 5. sstables: populate LargeDataRecords from writer — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable` 6. test: add LargeDataRecords round-trip unit tests — verify write/read, top-N bounding, below-threshold behavior 7. db: call initialize_virtual_tables from shard 0 only — preparatory refactoring to enable cross-shard coordination 8. db: implement large_data virtual tables with feature flag gating — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276 * Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport Closes scylladb/scylladb#29257 * github.com:scylladb/scylladb: db: implement large_data virtual tables with feature flag gating db: call initialize_virtual_tables from shard 0 only test: add LargeDataRecords round-trip unit tests sstables: populate LargeDataRecords from writer large_data_handler: return above_threshold_result from maybe_record_large_cells large_data_handler: rename partition_above_threshold to above_threshold_result sstables: add LargeDataRecords metadata type (tag 13) sstables: add fmt::formatter for large_data_type keys: move key_to_str() to keys/keys.hh	2026-04-16 14:03:31 +03:00
Pavel Emelyanov	4d352c7cf5	sstables: Remove ignore_component_digest_mismatch from sstable_open_config The ignore_component_digest_mismatch flag is now initialized at sstable construction time from sstables_manager::config (which is populated from db::config at boot time). Remove the flag from sstable_open_config struct and all call sites that were setting it explicitly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 13:49:14 +03:00
Pavel Emelyanov	9107e055b3	sstables: Move ignore_component_digest_mismatch initialization to constructor Initialize the ignore_component_digest_mismatch flag from sstables_manager::config in the sstable constructor initializer list instead of in load(). This ensures the flag value is set at construction time when the manager config is available, rather than at load time. Mark the member const to reflect its immutability after construction. Fixes the bootstrap path which now correctly reads the flag from manager config initialized from db::config at boot time, instead of using the default value. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 13:49:00 +03:00
Pavel Emelyanov	8abfd9af00	sstables: Add ignore_component_digest_mismatch to sstables_manager config Copy the ignore_component_digest_mismatch flag from db::config to sstables_manager::config during database initialization. This makes the flag available early in the boot process, before SSTables are loaded, enabling later commits to move the flag initialization from load-time to construction-time. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 13:48:49 +03:00
Łukasz Paszkowski	8d34127684	sstables: clean up TemporaryHashes file in wipe() The TemporaryHashes.db.tmp file is created during SSTable writing to store intermediate bloom filter hashes and is deleted before the SSTable is sealed. Since it is not tracked in the TOC, it is also absent from _recognized_components and all_components(). When an SSTable write fails before sealing (e.g. streaming rejected due to critical disk utilization), wipe() is called to clean up the partial SSTable. However, wipe() only iterates over all_components(), so the TemporaryHashes file was left behind as an orphan. Previously, the only cleanup mechanism for this file was the startup-time directory scanner in sstable_directory, which would not help when the orphan needs to be cleaned up at runtime. Explicitly remove the TemporaryHashes file in wipe(), ignoring ENOENT for the common case where the file was already removed before sealing.	2026-04-16 08:38:34 +02:00
Łukasz Paszkowski	159675e975	sstables: add error injection point in write_components Add a `write_components_writer_created` error injection point in `sstable::write_components()` between writer creation and fragment consumption. This injection is needed by the out-of-space streaming test (added in the next patch) to reliably pause SSTable writing at the right moment: after the SSTable writer has been created and files exist on disk, but before mutation fragments are consumed. Pausing earlier (before writer creation) would not work because there are no files on disk yet, while pausing later (after consuming fragments) would be too late to reliably push the node into critical disk utilization.	2026-04-16 08:38:34 +02:00
Benny Halevy	1f7faeef57	sstables: populate LargeDataRecords from writer During compaction (SSTable writing), maintain bounded min-heaps (one per large_data_type) that collect the top-N above-threshold records. On stream end, drain all five heaps into a single LargeDataRecords array and write it into the SSTable's scylla metadata component. Five separate heaps are used: - partition_size, row_size, cell_size: ordered by value (size bytes) - rows_in_partition, elements_in_collection: ordered by elements_count A new config option 'compaction_large_data_records_per_sstable' (default 10) controls the maximum number of records kept per type.	2026-04-16 08:49:02 +03:00
Benny Halevy	8f4976f65d	large_data_handler: return above_threshold_result from maybe_record_large_cells Change maybe_record_large_cells to return above_threshold_result with separate booleans for cell size (.size) and collection elements (.elements) thresholds. This allows the writer to track above_threshold counts for cell_size and elements_in_collection independently.	2026-04-16 08:49:02 +03:00
Benny Halevy	c1b797f288	large_data_handler: rename partition_above_threshold to above_threshold_result Rename partition_above_threshold to above_threshold_result and its 'rows' field to 'elements', making it a generic struct that can be reused for other large data types (e.g., cells with collection elements). Use designated initializers for clarity.	2026-04-16 08:49:02 +03:00
Benny Halevy	d92cd42fe6	sstables: add LargeDataRecords metadata type (tag 13) Add a new scylla metadata component LargeDataRecords (tag 13) that stores per-SSTable top-N large data records. Each record carries: - large_data_type (partition_size, row_size, cell_size, etc.) - binary serialized partition key and clustering key - column name (for cell records) - value (size in bytes) - element count (rows or collection elements, type-dependent) - range tombstones and dead rows (partition records only) The struct uses disk_string<uint32_t> for key/name fields and is serialized via the existing describe_type framework into the SSTable Scylla metadata component. Add JSON support in scylla-sstable and format documentation.	2026-04-16 08:49:01 +03:00
Benny Halevy	85e2c6f2a7	sstables: add fmt::formatter for large_data_type Add a fmt::formatter specialization for sstables::large_data_type and use it in scylla-sstable.cc instead of the local to_string() overload, which is removed.	2026-04-16 08:42:54 +03:00
Dimitrios Symonidis	24a7b146fa	sstables/utils/s3: split config update into sync and async parts Config observers run synchronously in a reactor turn and must not suspend. Split the previous monolithic async update_config() coroutine into two phases: Sync (runs in the observer, never suspends): - S3: atomically swap _cfg (lw_shared_ptr) and set a credentials refresh flag. - GCS: install a freshly constructed client; stash the old one for async cleanup. - storage_manager: update _object_storage_endpoints and fire the async cleanup via a gate-guarded background fiber. Async (gate-guarded background fiber): - S3: acquire _creds_sem, invalidate and rearm credentials only if the refresh flag is set. - GCS: drain and close stashed old clients.	2026-04-15 14:28:31 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Pavel Emelyanov	e0fa9ee332	Merge 'storage: implement sstable clone for object storage' from Ernest Zaslavsky This patch series implements `object_storage_base::clone`, which was previously a stub that aborted at runtime. Clone creates a copy of an sstable under a new generation and is used during compaction. The implementation uses server-side object copies (S3 CopyObject / GCS Objects: rewrite) and mirrors the filesystem clone semantics: TemporaryTOC is written first to mark the operation as in-progress, component objects are copied, and TemporaryTOC is removed to commit (unless the caller requested the destination be left unsealed). The first two patches fix pre-existing bugs in the underlying storage clients that were exposed by the new clone code path: - GCS `copy_object` used the wrong HTTP method (PUT instead of POST) and sent an invalid empty request body. - S3 `copy_object` silently ignored the abort_source parameter. 1. gcp_client: fix copy_object request method and body — Fix two bugs in the GCS rewrite API call. 2. s3_client: pass through abort_source in copy_object — Stop ignoring the abort_source parameter. 3. object_storage: add copy_object to object_storage_client — New interface method with S3 and GCS implementations. 4. storage: add make_object_name overload with generation — Helper for building destination object names with a different generation. 5. storage: make delete_object const — Needed by the const clone method. 6. storage: implement object_storage_base::clone — The actual clone implementation plus a copy_object wrapper. 7. test/boost: enable sstable clone tests for S3 and GCS — Re-enable the previously skipped tests. A test similar to `sstable_clone_leaving_unsealed_dest_sstable` was added to properly test the sealed/unsealed states for object storage. Works for both S3 and GCS. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1045 Prerequisite: https://github.com/scylladb/scylladb/pull/28790 No need to backport since this code targets future feature Closes scylladb/scylladb#29166 * github.com:scylladb/scylladb: compaction_test: enable sstable clone tests for S3 and GCS storage: implement object_storage_base::clone storage: make delete_object const in object_storage_base storage: add make_object_name overload with generation sstables: add get_format() accessor to sstable object_storage: add copy_object to object_storage_client s3_client: pass through abort_source in copy_object gcp_client: fix copy_object request method and body	2026-04-08 09:35:10 +03:00
Ernest Zaslavsky	7cd9bbb010	storage: implement object_storage_base::clone Implement the clone method for object_storage_base, which creates a copy of an sstable with a new generation using server-side object copies. Also add a const copy_object convenience wrapper, similar to the existing put_object and delete_object wrappers. A dedicated test for the new object storage clone path will be added in the following commit. The preexisting local-filesystem clone is already covered by the sstable_clone_leaving_unsealed_dest_sstable test.	2026-04-07 18:16:52 +03:00
Ernest Zaslavsky	8fa82e6b6f	storage: make delete_object const in object_storage_base The method doesn't modify any member state. Making it const is needed for calling it from the const clone method.	2026-04-07 18:16:52 +03:00
Ernest Zaslavsky	47387341bb	storage: add make_object_name overload with generation Add a make_object_name overload that accepts a target generation parameter for constructing object names with a generation different from the source sstable's own. Refactor the original make_object_name to delegate to the new overload, eliminating code duplication. This is needed by clone to build destination object names for the new generation.	2026-04-07 18:16:52 +03:00
Ernest Zaslavsky	8bd891c6ed	sstables: add get_format() accessor to sstable Add a public get_format() accessor for the _format member, following the same pattern as the existing get_version(). This allows storage implementations to access the sstable format without reaching into private members, and is needed by the upcoming object_storage_base::clone to construct entry_descriptor for the sstables registry.	2026-04-07 18:16:52 +03:00
Ernest Zaslavsky	3d23490615	object_storage: add copy_object to object_storage_client Add a copy_object method to the object_storage_client interface for server-side object copies, with implementations for both S3 and GCS wrappers. The S3 wrapper delegates to s3::client::copy_object. The GCS wrapper delegates to gcp::storage::client's cross-bucket copy_object overload. This is a prerequisite for implementing sstable clone on object storage.	2026-04-07 18:16:52 +03:00
Pavel Emelyanov	58e59e8c0d	Merge 'test: add test_sstable_clone_preserves_staging_state' from Benny Halevy Add a test that verifies filesystem_storage::clone preserves the sstable state: an sstable in staging is cloned to a new generation, the clone is re-loaded from the staging directory, and its state is asserted to still be staging. The change proves that https://scylladb.atlassian.net/browse/SCYLLADB-1205 is invalid, and can be closed. * No functional change and no backport needed Closes scylladb/scylladb#29209 * github.com:scylladb/scylladb: test: add test_sstable_clone_preserves_staging_state test: derive sstable state from directory in test_env::make_sstable sstables: log debug message in filesystem_storage::clone	2026-04-07 17:02:04 +03:00
Ernest Zaslavsky	8f6630e9cd	storage: add `exists` method to storage abstraction Add an `exists` method to the storage abstraction to allow S3, GCS, and local storage implementations to check whether an sstable component is present.	2026-04-05 11:07:17 +03:00
Botond Dénes	2d2ff4fbda	sstables: use chunked_managed_vector for promoted indexes in partition_index_page Switch _promoted_indexes storage in partition_index_page from managed_vector to chunked_managed_vector to avoid large contiguous allocations. Avoid allocation failure (or crashes with --abort-on-internal-error) when large partitions have enough promoted index entries to trigger a large allocation with managed_vector. Fixes: SCYLLADB-1315 Closes scylladb/scylladb#29283	2026-03-31 18:43:57 +03:00
Benny Halevy	ca9ff134b8	sstables: log debug message in filesystem_storage::clone	2026-03-24 12:26:03 +02:00
Pavel Emelyanov	c4a0f6f2e6	object_store: Don't leave dangling objects by iterating moved-from names vector The code in upload_file std::move()-s vector of names into merge_objects() method, then iterates over this vector to delete objects. The iteration is apparently a no-op on moved-from vector. The fix is to make merge_objects() helper get vector of names by const reference -- the method doesn't modify the names collection, the caller keeps one in stable storage. Fixes #29060 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29061	2026-03-20 10:09:30 +02:00
Botond Dénes	97430e2df5	Merge 'Fix object storage lister entries walking loop' from Pavel Emelyanov Two issues found in the lister returned by gs_client_wrapper::make_object_lister() Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access Fixes #29058 The code appeared in 2026.1, worth fixing it there as well Closes scylladb/scylladb#29059 * github.com:scylladb/scylladb: sstables: Fix object storage lister not resetting position in batch vector sstables: Fix object storage lister skipping entries when filter is active	2026-03-20 09:12:42 +02:00
Avi Kivity	062751fcec	Merge 'db/config: enable ms sstable format by default' from Łukasz Paszkowski Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes. Make the new format a new default for new clusters by naming ms in the default scylla.yaml. New functionality. No backport needed. This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix. Closes scylladb/scylladb#28960 * github.com:scylladb/scylladb: db/config: announce ms format as highest supported db/config: enable `ms` sstable format by default cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format api/system: add /system/chosen_sstable_version test/cluster/dtest: reduce num_tokens to 16	2026-03-19 18:19:01 +02:00
Tomasz Grabiec	4410e9c61a	sstables: mx: index_reader: Optimize parsing for no promoted index case It's a common case with small partition workloads.	2026-03-18 16:25:21 +01:00
Tomasz Grabiec	f55bb154ec	sstables: mx: index_reader: Amoritze partition key storage This change reduces the cost of partition index page construction and LSA migration. This is achieved by several things working together: - index entries don't store keys as separate small objects (managed_bytes) They are written into one managed_bytes fragmented storage, entries hold offset into it. Before, we paid 16 bytes for managed_bytes plus LSA descriptor for the storage (1 byte) plus back-reference in the storage (8 bytes), so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16 bytes, that's a reduction from 31 bytes to 20 bytes per key. - index entries and key storage are now trivially moveable, so LSA migration can use memcpy() which amortizes the cost per key. memcpy(). LSA eviction is now trivial and constant time for the whole page regardless of the number of entries. Page eviction dropped from 14 us to 1 us. This improves throughput in a CPU-bound miss-heavy read workload where the partition index doesn't fit in memory. scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: 15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors) 15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors) 15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors) 15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors) 15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors) 15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors) 15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors) After: 21620.18 tps (148.4 allocs/op, 13.4 logallocs/op, 43.7 tasks/op, 176817 insns/op, 153183 cycles/op, 0 errors) 20644.03 tps (149.8 allocs/op, 13.5 logallocs/op, 44.3 tasks/op, 187941 insns/op, 160409 cycles/op, 0 errors) 20588.06 tps (150.1 allocs/op, 13.5 logallocs/op, 44.5 tasks/op, 188090 insns/op, 160818 cycles/op, 0 errors) 20789.29 tps (149.5 allocs/op, 13.5 logallocs/op, 44.2 tasks/op, 186495 insns/op, 159382 cycles/op, 0 errors) 20977.89 tps (149.5 allocs/op, 13.4 logallocs/op, 44.2 tasks/op, 183969 insns/op, 158140 cycles/op, 0 errors) 21125.34 tps (149.1 allocs/op, 13.4 logallocs/op, 44.1 tasks/op, 183204 insns/op, 156925 cycles/op, 0 errors) 21244.42 tps (148.6 allocs/op, 13.4 logallocs/op, 43.8 tasks/op, 181276 insns/op, 155973 cycles/op, 0 errors) Mostly because the index now fits in memory. When it doesn't, the benefits are still visible due to lower LSA overhead.	2026-03-18 16:25:21 +01:00
Tomasz Grabiec	50dc7c6dd8	sstables: mx: index_reader: Keep promoted_index info next to index_entry Densely populated pages have no promoted index (small partitions), so we can save space in such workloads by keeping promoted index in a separate vector. For workloads which do have a promoted index, pages have only one partition. There aren't many such pages and they are long-lived, so the extra allocation of the vector is amortized. promoted_index class is removed, and replaced with equivalent parsed_promoted_index_entry for simplicity. Because it's removed, make_cursor() is moved into the index_reader class. Reducing the size of index_entry is important for performence if pages are densly populated. It helps to reduce LSA allocator pressure and compaction/eviction speed. This change, combined with the earlier change "Shave-off 16 bytes from index_entry by using raw_token", gives significant improvement in throughput in perf_simple_query run where the index doesn't fit in memory: scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: 9714.78 tps (170.9 allocs/op, 16.9 logallocs/op, 55.3 tasks/op, 494788 insns/op, 343920 cycles/op, 0 errors) 9603.13 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 502358 insns/op, 348344 cycles/op, 0 errors) 9621.43 tps (171.9 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 500612 insns/op, 347508 cycles/op, 0 errors) 9597.75 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 501428 insns/op, 348604 cycles/op, 0 errors) 9615.54 tps (171.6 allocs/op, 16.9 logallocs/op, 55.6 tasks/op, 501313 insns/op, 347935 cycles/op, 0 errors) 9577.03 tps (171.8 allocs/op, 17.0 logallocs/op, 55.7 tasks/op, 503283 insns/op, 349251 cycles/op, 0 errors) After: 15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors) 15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors) 15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors) 15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors) 15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors) 15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors) 15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)	2026-03-18 16:25:20 +01:00

1 2 3 4 5 ...

4093 Commits