scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-23 00:02:37 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	1c0f8ab66e	Merge 'sstables: introduce --abort-on-malformed-sstable-error' from Botond Dénes When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site. This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths: - Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`) - Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`) - `parse_assert()` failures (via `on_parse_error()`) - BTI parse errors (via `on_bti_parse_error()`) The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure. The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption. Commit breakdown: 1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc` 2. `on_parse_error()` and `on_bti_parse_error()` check the new flag 3. All ~50 `throw malformed_sstable_exception(...)` sites migrated 4. Both `throw bufsize_mismatch_exception(...)` sites migrated Refs: SCYLLADB-1087 Backport: new feature, no backport Closes scylladb/scylladb#29324 * github.com:scylladb/scylladb: sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception() sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose sstables: introduce --abort-on-malformed-sstable-error infrastructure sstables: refactor parse_path() to return std::expected<> instead of throwing	2026-05-12 12:38:25 +03:00
Botond Dénes	cf37f541a0	Merge ' sstables_loader: ensure upload directory is empty when load_and_stream returns' from Taras Veretilnyk After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`) returns success, source sstable files in the `upload/` directory may still be on disk. `mark_for_deletion()` only sets an in-memory flag; the actual file deletion runs lazily when the last `shared_sstable` reference drops. This leaves a window between API success and physical deletion where a follow-up scan of the upload directory can detected sstables that will be deleted soon. This might cause failure because SSTable will be already wiped during processing. For fix: Force unlink to complete before `stream()` returns, so the upload directory is in a consistent state by the time the API reports success. For tablet streaming, partially-contained sstables participate in multiple per-tablet batches; eagerly unlinking after each batch would break the next batch that still needs to read the file. A `defer_unlinking` flag on the streamer postpones the explicit unlink until after all batches complete (called once at the end of `tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of `stream_sstable_mutations`. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647 Backport is required, as it is a bug fix that was introduced in `517a4dc4df`. Closes scylladb/scylladb#29599 * github.com:scylladb/scylladb: sstables_loader: synchronously unlink streamed sstables before returning sstables: make sstable::unlink() idempotent	2026-05-11 14:43:46 +03:00
Botond Dénes	2edfb91070	sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception() Replace the two remaining direct 'throw bufsize_mismatch_exception(...)' call sites with the new throw_bufsize_mismatch_exception() helper, which routes through throw_malformed_sstable_exception() and thus also respects the --abort-on-malformed-sstable-error flag. Affected files: - sstables/sstables.cc (1 site, in check_buf_size()) - sstables/m_format_read_helpers.cc (1 site, in check_buf_size())	2026-05-11 11:58:14 +03:00
Botond Dénes	d65c1523c2	sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() Replace all direct 'throw malformed_sstable_exception(...)' call sites with the new throw_malformed_sstable_exception() helper, which respects the --abort-on-malformed-sstable-error flag.	2026-05-11 11:58:14 +03:00
Botond Dénes	84c27658d9	sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error Both functions now check abort_on_malformed_sstable_error() first. If set, they log the error and call std::abort() directly, generating a coredump. Otherwise they fall through to the existing on_internal_error() path, which is in turn controlled by --abort-on-internal-error.	2026-05-11 11:58:14 +03:00
Botond Dénes	4ebcc002d6	sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose Add scoped_no_abort_on_malformed_sstable_error RAII guard (modeled after seastar::testing::scoped_no_abort_on_internal_error) and use it in all tests that intentionally corrupt sstables and expect malformed_sstable_exception to be thrown rather than the process aborting.	2026-05-11 11:58:14 +03:00
Botond Dénes	f6dc2cb5f8	sstables: introduce --abort-on-malformed-sstable-error infrastructure Add the --abort-on-malformed-sstable-error command-line option and the supporting infrastructure. When set, any malformed sstable error will abort the process and generate a coredump instead of throwing an exception. This is useful for debugging memory corruption that may manifest as apparent sstable corruption. The implementation introduces: - throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception() helper functions in sstables/sstables.cc, which check the new flag and either abort (with logging) or throw the appropriate exception. - set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error() to control the per-process atomic flag. - abort_on_malformed_sstable_error config option (LiveUpdate, default false) wired up in main.cc alongside abort_on_internal_error. Call-site migration will follow in subsequent commits.	2026-05-11 11:58:14 +03:00
Botond Dénes	c3daa6379c	sstables: refactor parse_path() to return std::expected<> instead of throwing make_entry_descriptor() and the two overloads of parse_path() used to signal parse failures by throwing malformed_sstable_exception, which made parse_path() expensive to use as a probe (e.g. to classify directory entries). Change make_entry_descriptor() and both parse_path() overloads to return std::expected<T, sstring>, where the sstring carries the error message on failure, eliminating the exception overhead at probe call sites. Call sites that previously caught malformed_sstable_exception to treat the path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc, tools/scylla-sstable.cc) now check the expected result directly. Call sites where a parse failure is a genuine error (sstable_directory.cc, sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw explicitly as malformed_sstable_exception using the error string, preserving the existing error propagation behaviour.	2026-05-11 11:58:14 +03:00
Taras Veretilnyk	7cdf215999	sstables: make sstable::unlink() idempotent Avoid duplicate work when unlink() is called more than once on the same sstable. This happens when a caller invokes unlink() explicitly on an sstable that is also marked for deletion: the destructor's close_files() path would otherwise call unlink() again, re-firing _on_delete, double-counting _stats.on_delete() and double-invoking _manager.on_unlink().	2026-04-21 22:41:02 +02:00
Botond Dénes	69c58c6589	Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage. The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it. This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901 The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions. Closes scylladb/scylladb#28873 * github.com:scylladb/scylladb: streaming: reject mutation fragments on critical disk utilization test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection sstables: clean up TemporaryHashes file in wipe() sstables: add error injection point in write_components test/cluster/storage: extract validate_data_existence to module scope test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity utils/disk_space_monitor: add error injection to suppress threshold checks	2026-04-20 17:56:36 +03:00
Botond Dénes	57f8be49e9	Merge 'Move ignore_component_digest_mismatch flag on sstables_manager' from Pavel Emelyanov The PR serves two purposes. First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes. Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one. Code cleanup, not backporting Closes scylladb/scylladb#29513 * github.com:scylladb/scylladb: sstables: Remove ignore_component_digest_mismatch from sstable_open_config sstables: Move ignore_component_digest_mismatch initialization to constructor sstables: Add ignore_component_digest_mismatch to sstables_manager config	2026-04-17 12:54:17 +03:00
Łukasz Paszkowski	4657d9e32c	streaming: reject mutation fragments on critical disk utilization The stream_mutation_fragments RPC handler did not check is_in_critical_disk_utilization_mode before accepting incoming mutation fragments. This meant load-and-stream (nodetool refresh --load-and-stream) could push data onto a node at critical disk utilization, potentially filling the disk completely. Add a critical disk utilization check in the get_next_mutation_fragment lambda, throwing critical_disk_utilization_exception when the node is in critical mode. This mirrors the existing protection in stream_blob.cc. Also remove the xfail marker from the corresponding test added in the previous commit.	2026-04-17 09:31:26 +02:00
Pavel Emelyanov	9107e055b3	sstables: Move ignore_component_digest_mismatch initialization to constructor Initialize the ignore_component_digest_mismatch flag from sstables_manager::config in the sstable constructor initializer list instead of in load(). This ensures the flag value is set at construction time when the manager config is available, rather than at load time. Mark the member const to reflect its immutability after construction. Fixes the bootstrap path which now correctly reads the flag from manager config initialized from db::config at boot time, instead of using the default value. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 13:49:00 +03:00
Łukasz Paszkowski	159675e975	sstables: add error injection point in write_components Add a `write_components_writer_created` error injection point in `sstable::write_components()` between writer creation and fragment consumption. This injection is needed by the out-of-space streaming test (added in the next patch) to reliably pause SSTable writing at the right moment: after the SSTable writer has been created and files exist on disk, but before mutation fragments are consumed. Pausing earlier (before writer creation) would not work because there are no files on disk yet, while pausing later (after consuming fragments) would be too late to reliably push the node into critical disk utilization.	2026-04-16 08:38:34 +02:00
Benny Halevy	d92cd42fe6	sstables: add LargeDataRecords metadata type (tag 13) Add a new scylla metadata component LargeDataRecords (tag 13) that stores per-SSTable top-N large data records. Each record carries: - large_data_type (partition_size, row_size, cell_size, etc.) - binary serialized partition key and clustering key - column name (for cell records) - value (size in bytes) - element count (rows or collection elements, type-dependent) - range tombstones and dead rows (partition records only) The struct uses disk_string<uint32_t> for key/name fields and is serialized via the existing describe_type framework into the SSTable Scylla metadata component. Add JSON support in scylla-sstable and format documentation.	2026-04-16 08:49:01 +03:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Taras Veretilnyk	c123f637ea	sstables: add option to ignore component digest mismatches Add `ignore_component_digest_mismatch` option to `sstable_open_config` that logs a warning instead of throwing `malformed_sstable_exception` on component digest mismatch. This is useful for recovering sstables with corrupted non-vital components or working around bugs in digest calculation. Expose the option in scylla-sstable via the `--ignore-component-digest-mismatch` flag for the upgrade operation.	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	e78a3d2c44	sstables: validate index components digests during SSTable scrub in validate mode	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	9decbdeab0	sstables: verify component digests on SSTable load Add integrity verification for SSTable component files by validating their CRC32 digests against the expected values stored in Scylla metadata during SSTable loading. The following components are validated on load: TOC, Scylla metadata, CompressionInfo, Statistics, Summary, and Filter.	2026-03-10 19:24:05 +01:00
Botond Dénes	81e214237f	Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk This pull request adds support for calculation and storing CRC32 digests for all SSTable components. This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream. Several test cases where introduced to verify expected behaviour. Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting. Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup. However, with component digests stored in scylla_metadata (#20100), replacing a component like Statistics requires atomically updating both the component and scylla_metadata with the new digest - impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla_metadata - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component along with updated scylla_metadata containing the new digest - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. Backport is not required, it is a new feature Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453 Closes scylladb/scylladb#28338 * github.com:scylladb/scylladb: docs: document components_digests subcomponent and trailing digest in Scylla.db sstable_compaction_test: Add tests for perform_component_rewrite sstable_test: add verification testcases of SSTable components digests persistance sstables: store digest of all sstable components in scylla metadata sstables: replace rewrite_statistics with new rewrite component mechanism sstables: add new rewrite component mechanism for safe sstable component rewriting compaction: add compaction_group_view method to specify sstable version sstables: add null_data_sink and serialized_checksum for checksum-only calculation sstables: extract default write open flags into a constant sstables: Add write_simple_with_digest for component checksumming sstables: Extract file writer closing logic into separate methods sstables: Implement CRC32 digest-only writer	2026-03-10 16:02:53 +02:00
Taras Veretilnyk	54af4a26ca	sstables: store digest of all sstable components in scylla metadata This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in scylla metadata component. This also extends new rewrite component mechanism, to rewrite metadata with updated digest together with the component.	2026-03-06 21:58:10 +01:00
Botond Dénes	04b001daa6	sstable: move away from tombstone_gc_mode::operator bool() It is ambiguous, use tombstone_gc_mode::is_gc_enabled() instead. Note that the two has slightly different meanings, operator bool() returned true when repair-history related functionality was enabled. This is fine, because the only two users are logs, where the two meanings are close enough. All other users were eliminated or migrated already, taking the change in meaning into account.	2026-03-03 14:09:28 +02:00
Taras Veretilnyk	5bbc44ed12	sstables: replace rewrite_statistics with new rewrite component mechanism This commits migrates all callers that used rewrite_statistics to new rewrite component mechanism.	2026-02-26 22:38:55 +01:00
Taras Veretilnyk	51c345aaf6	sstables: add new rewrite component mechanism for safe sstable component rewriting Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup. However, this approach won't work once component digests are stored in scylla_metadata, as replacing a component like Statistics will require atomically updating both the component and scylla_metadata with the new digest—impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it. - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.	2026-02-26 22:38:55 +01:00
Taras Veretilnyk	16ea7a8c1c	sstables: add null_data_sink and serialized_checksum for checksum-only calculation Introduce a null_data_sink and make_digest_calculator implementation that discards all writes, enabling checksum calculation without file I/O. This allows the existing checksummed_file_writer to be used for digest computation without writing data to disk. This will be used in a future commit to calculate the scylla metadata component checksum before writing it to disk, allowing the component to store its own checksum.	2026-02-26 22:38:51 +01:00
Taras Veretilnyk	f140ab0332	sstables: extract default write open flags into a constant Extract the commonly used `open_flags::wo \| open_flags::create \| open_flags::exclusive` into a reusable constant `sstable_write_open_flags` to reduce duplication.	2026-02-13 14:27:01 +01:00
Taras Veretilnyk	c8281b7b8b	sstables: Add write_simple_with_digest for component checksumming Introduce new methods to write SSTable components while calculating and returning their CRC32 checksums. This adds: - make_digests_component_file_writer(): creates a crc32_digest_file_writer for component writing with checksum tracking - write_simple_with_digest() and do_write_simple_with_digest(): write components and return the full checksum value	2026-02-13 14:27:01 +01:00
Pavel Emelyanov	4a307d931a	sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink The former accumulates sstable writer writes into a vector of temporary buffers. In seastar there's a generic memory data sink that provides a sink to accumulate stream of bytes into any container. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-23 14:22:22 +03:00
Pavel Emelyanov	97b1340a68	sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl The latter is used to wrap vector of buffers into an input_stream. Seastar already provides the very same functionality with the convenience as_input_stream() helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-23 14:21:14 +03:00
Pavel Emelyanov	32cf358f44	sstable: Simplify storage::snapshot() Now there are only two callers left -- sstable::snapshot() and sstable::seal() that wants to auto-backup the sealed sstable. The snapshot arguments are: - relative path, use _base_dir - no new generation provided - no leave-unsealed tag With that, the implementation of filesystem_storage::snapshot() is as simple as - prepare full path relative to _base_dir - touch new directory - call create_links_common() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-26 09:47:27 +03:00
Pavel Emelyanov	8e496a2f2f	sstables: Introduce storage::clone() And call it from sstable::clone() instead of storage::snapshot(). The snapshot arguements are: - target directory is storage::prefix(), that's _dir itself - new generation is always provided, no need for optional - leave_unsealed bool flag With that, the implementation of filesystem_storage::clone() is as simple as call create_links_common() forwarding args and _dir to it. The unification of leave_unsealed branches will come a bit later making this code even shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-26 09:47:27 +03:00
Botond Dénes	3071ccd54a	Merge 'Storage-agnostic table::snapshot_on_all_shards()' from Pavel Emelyanov The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself. Closes scylladb/scylladb#27762 * github.com:scylladb/scylladb: table: Move snapshot_file_set to table.cc table: Rename and move snapshot_on_all_shards() method table: Ditch jsondir variable table, sstables: Pass snapshot name to sstable::snapshot() table: Use snapshot_writer in write_manifest() table: Use snapshot_writer in write_schema_as_cql() table: Add snapshot_writer::sync() table: Add snapshot_writer::init() table: Introduce snapshot_writer table: Move final sync and rename seal_snapshot() table: Hide write_schema_as_cql() table: Hide table::seal_snapshot() table: Open-code finalize_snapshot() table: Fix indentation after previuous patch table: Use smp::invoke_on_all() to populate the vector with filenames table: Don't touch dir once more on seal_snapshot() table: Open-code table::take_snapshot() into caller lambda table: Move parts of table::take_snapshot to sstables_manager table: Introduce table::take_snapshot() table: Store the result of smp::submit_to in local variable	2025-12-24 13:46:47 +02:00
Pavel Emelyanov	a21aa5bdf6	table, sstables: Pass snapshot name to sstable::snapshot() Currently sstable::snapshot() is called with directory name where to put snapshots into. This patch changes it to accept snapshot name instead. This makes the table-sstable API be unware of snapshot destination storage type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:14:36 +03:00
Botond Dénes	bfdd4f7776	Merge 'Synchronize incremental repair and tablet split' from Raphael Raph Carvalho Split prepare can run concurrently with repair. Consider this: 1) split prepare starts 2) incremental repair starts 3) split prepare finishes 4) incremental repair produces unsplit sstable 5) split is not happening on sstable produced by repair 5.1) that sstable is not marked as repaired yet 5.2) might belong to repairing set (has compaction disabled) 6) split executes 7) repairing or repaired set has unsplit sstable If split was acked to coordinator (meaning prepare phase finished), repair must make sure that all sstables produced by it are split. It's not happening today with incremental repair because it disables split on sstables belonging to repairing group. And there's a window where sstables produced by repair belong to that group. To solve the problem, we want the invariant where all sealed sstables will be split. To achieve this, streaming consumers are patched to produce unsealed sstable, and the new variant add_new_sstable_and_update_cache() will take care of splitting the sstable while it's unsealed. If no split is needed, the new sstable will be sealed and attached. This solution was also needed to interact nicely with out of space prevention too. If disk usage is critical, split must not happen on restart, and the invariant aforementioned allows for it, since any unsplit sstable left unsealed will be discarded on restart. The streaming consumer will fail if disk usage is critical too. The reason interposer consumer doesn't fully solve the problem is because incremental repair can start before split, and the sstable being produced when split decision was emitted must be split before attached. So we need a solution which covers both scenarios. Fixes #26041. Fixes #27414. Should be backported to 2025.4 that contains incremental repair Closes scylladb/scylladb#26528 * github.com:scylladb/scylladb: test: Add reproducer for split vs intra-node migration race test: Verify split failure on behalf of repair during critical disk utilization test: boost: Add failure_when_adding_new_sstable_test test: Add reproducer for split vs incremental repair race condition compaction: Fail split of new sstable if manager is disabled replica: Don't split in do_add_sstable_and_update_cache() streaming: Leave sstables unsealed until attached to the table replica: Wire add_new_sstables_and_update_cache() into intra-node streaming replica: Wire add_new_sstable_and_update_cache() into file streaming consumer replica: Wire add_new_sstable_and_update_cache() into streaming consumer replica: Document old add_sstable_and_update_cache() variants replica: Introduce add_new_sstables_and_update_cache() replica: Introduce add_new_sstable_and_update_cache() replica: Account for sstables being added before ACKing split replica: Remove repair read lock from maybe_split_new_sstable() compaction: Preserve state of input sstable in maybe_split_new_sstable() Rename maybe_split_sstable() to maybe_split_new_sstable() sstables: Allow storage::snapshot() to leave destination sstable unsealed sstables: Add option to leave sstable unsealed in the stream sink test: Verify unsealed sstable can be compacted sstables: Allow unsealed sstable to be loaded sstables: Restore sstable_writer_config::leave_unsealed	2025-12-23 07:28:56 +02:00
Benny Halevy	9e18cfbe17	sstable: add _mutate_sem to serialize link/move with components rewrite We currently have races, like between moving an sstable from staging using change_state, or when taking a snapshot, to e.g. rewrite_statistics that replaces one of the sstable component files when called, for example, from update_repaired_at by incremental repair. Use a semaphore as a mutex to serialize those functions. Note that there is no need for rwlock since the operations are rare and read-only operations like snapshot don't need to run in parallel. Fixes #25919 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 17:06:45 +02:00
Botond Dénes	85f05fbe1b	Revert "Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk" This reverts commit `866c96f536`, reversing changes made to `367633270a`. This change caused all longevities to fail, with a crash in parsing scylla-metadata. The investigation is still ongoing, with no quick fix in sight yet. Fixes: #27496 Closes scylladb/scylladb#27518	2025-12-16 11:34:40 +02:00
Raphael S. Carvalho	1a077a80f1	sstables: Allow storage::snapshot() to leave destination sstable unsealed Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	c5e840e460	sstables: Add option to leave sstable unsealed in the stream sink That will be needed for file streaming to leave output sstable unsealed. we want the invariant where all sealed sstables are split after split was ACKed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	ab82428228	sstables: Allow unsealed sstable to be loaded File streaming will have to load an unsealed sstable, so we need to be able to parse components from temporary TOC instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
copilot-swe-agent[bot]	77ee7f3417	Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy" This reverts commit `8192f45e84`. The merge exposed a bug where truncate (via drop) fails and causes Raft errors, leading to schema inconsistencies across nodes. This results in test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist' errors. The specific problematic change was in commit `19b6207f` which modified truncate_table_on_all_shards to set use_sstable_identifier = true. This causes exceptions during truncate that are not properly handled, leading to Raft applier fiber stopping and nodes losing schema synchronization.	2025-12-12 03:55:13 +00:00
Tomasz Grabiec	7df610b73d	sstables: Remove host id mismatch warning for sstable streaming Tablet migration transfers sstable files without changing origin host-id. As it should, becuase those sstables were not written on the destination host, and should be ignored by commit log replay. So it's a normal situation, and it's confusing to see this warning in logs. Fixes #26957 Closes scylladb/scylladb#27433	2025-12-08 18:39:22 +02:00
Pavel Emelyanov	8192f45e84	Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier. When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory and the manifest.json file, rather than using the sstable generation. This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable may be migrated across shards or across nodes, and in this case, its generation may change, but its sstable identifier remains sstable. Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to object storage and exist in previous backed up snapshots. Historically, the sstable generation was guaranteed to be unique only per table per node, so the dedup code currently checks for deduplication in the node scope. However, with tablet migration, sstables are renamed when migrated to a different shard, i.e. their generation changes, and they may be renamed when migrated to another node, but even if they are not, the dedup logic still assumes uniqueness only within a node. To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since `3a12ad96c7`). Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a numerical replication factor to a rack list containing a subset of the available racks in a datacenter). Fixes #27181 * New feature, no backport required Closes scylladb/scylladb#27184 * github.com:scylladb/scylladb: database: truncate_table_on_all_shards: set use_sstable_identifier to true nodetool: snapshot: add --use-sstable-identifier option api: storage_service: take_snapshot: add use_sstable_identifier option test: database_test: add snapshot_use_sstable_identifier_works test: database_test: snapshot_works: add validate_manifest sstable: write_scylla_metadata: add random_sstable_identifier error injection table: snapshot_on_all_shards: take snapshot_options sstable: add get_format getter sstable: snapshot: add use_sstable_identifier option db: snapshot_ctl: snapshot_options: add use_sstable_identifier options db: snapshot_ctl: move skip_flush to struct snapshot_options	2025-12-08 12:56:12 +03:00
Taras Veretilnyk	bc2e83bc1f	sstables: store digest of all sstable components in scylla metadata This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.	2025-12-04 21:00:09 +01:00
Benny Halevy	07b92a1ee8	test: database_test: add snapshot_use_sstable_identifier_works Test that taking a snapshot with the use_sstable_identifier option (and injecting `random_sstable_identifier`) produces different file names in the snapshot than the original sstable names and validate te manifest.json file respectively. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:57:38 +02:00
Benny Halevy	28cb300d0a	sstable: write_scylla_metadata: add random_sstable_identifier error injection To be used by a unit test in the following patch for testing the snapshot use_sstable_identifier option. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:55:50 +02:00
Benny Halevy	7c62417b54	sstable: snapshot: add use_sstable_identifier option When set to true, use the sstable_identifier as the sstable name in the snapshot rather than its generation. sstable::snapshot now returns the generation it used for the sstable in the snapshot, based on the `use_sstable_identifier` option, to be used by the upper layer generating the manifest. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:53:32 +02:00
Botond Dénes	296d7b8595	Merge 'Enable digest+checksum verification for file based streaming' from Taras Veretilnyk This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream. These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data. For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculating and verifying the diges. New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches. Backport is not required, since it is a new feature Fixes #21776 Closes scylladb/scylladb#26702 * github.com:scylladb/scylladb: file_stream_test: add sstable file streaming integrity verification test cases streaming: prioritize sender-side errors in tablet_stream_files sstables: enable integrity check for data file streaming sstables: Add compressed raw streaming support sstables: Allow to read digest and checksum from user provided file instance sstables: add overload of data_stream() to accept custom file_input_stream_options	2025-11-24 06:37:27 +02:00
Taras Veretilnyk	c8d2f89de7	sstables: enable integrity check for data file streaming This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream. These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data. For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculation and verifying the digest.	2025-11-21 12:52:26 +01:00
Taras Veretilnyk	18e1dbd42e	sstables: Add compressed raw streaming support Implement compressed_raw_file_data_source that streams compressed chunks without decompression while verifying checksums and calculating digests. Extends raw_stream enum to support compressed_chunks mode. This data_source implementation will be used in the next commits for file based streaming.	2025-11-21 12:52:04 +01:00
Taras Veretilnyk	c32e9e1b54	sstables: Allow to read digest and checksum from user provided file instance Add overloaded methods to read digest and checksum from user-provided file handles: - 'read_digest(file f)' - 'read_checksum(file f) This will be useful for tablet file-based streaming to enable integrity verification, as the streaming code uses SSTable snapshots with open files to prevent missing components when SSTables are unlinked.	2025-11-21 12:51:40 +01:00

1 2 3 4 5 ...

1418 Commits