scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 12:06:44 +00:00

Author	SHA1	Message	Date
Botond Dénes	555cfbcd38	Merge 'treewide: replace deprecated smp::count and smp::all_cpus() with new APIs' from Avi Kivity Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched). Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads. Notable cases: - dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable. - service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads. - schema_builder: sometimes called from BOOST_AUTO_TEST_CASE without a reactor. Added pre-patch that makes the implicit shard count parameter implicit and pass 1 in those cases. Not changed: - scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context). - Python test files: only reference smp::count in comments/strings. No backport: the Seastar commit that deprecated these function hasn't (and won't) make its way into any release branches (and the warnings are cosmetic anyway) Closes scylladb/scylladb#29990 * github.com:scylladb/scylladb: treewide: replace deprecated smp::count and smp::all_cpus() with new APIs scylla-gdb: read shard count from smp::_this_smp instead of smp::count schema_builder: make shard_count an explicit constructor parameter	2026-05-27 09:42:06 +03:00
Avi Kivity	8010e408a2	treewide: replace deprecated smp::count and smp::all_cpus() with new APIs Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched). Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads. Notable cases: - dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable. - service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads. Not changed: - scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context). - Python test files: only reference smp::count in comments/strings.	2026-05-26 17:35:20 +03:00
Botond Dénes	0fd25dc47c	Merge 'Replace get_injection_parameters() with inject_parameter() where appropriate' from Pavel Emelyanov Several error injection sites use the low-level get_injection_parameters() API to fetch the entire parameters map and then manually look up a single key. The inject_parameter() API is better suited for these cases — it combines the enabled check and typed single-parameter extraction in one call, returning std::optional. Cleaning error injection usage, not backporting Closes scylladb/scylladb#29970 * github.com:scylladb/scylladb: test: Use inject_parameter() in row_cache_test sstables: Use inject_parameter() for mx reader fill buffer timeout streaming: Use inject_parameter() for order_sstables_for_streaming	2026-05-26 10:32:44 +03:00
Yaniv Michael Kaul	acd3115645	sstables: include SSTable filename in Stats metadata error messages When Stats metadata is not available or malformed, include the SSTable filename in the error message to help operators identify which SSTable files need attention during startup failures. Fixes: https://github.com/scylladb/scylla-enterprise/issues/5439 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> AI-assisted: yes Backport: no, benign improvement Closes scylladb/scylladb#29950	2026-05-22 16:49:37 +03:00
Avi Kivity	305346a3ec	Merge 'Don't materialize collections into intermediate representations' from Botond Dénes Collections have an age-old problem in ScyllaDB: they had to be unserialized into an intermediate representation for any access or manipulation. The intermediate representation needs effort to produce and also requires additional memory to store. Both can be significant for large collections. This intermediate representation is then either discarded immediately after use, or re-serialized again. This problem was significant enough for us to consider the use of collections as somewhat of an anti-pattern. But our customers keep using it. Alternator is also a heavy user of collections. This PR aims to solve this problem once and for all. The plan is as follows: * Promote direct use of the serialized collection format: - Add accessor methods to `collection_mutation_view` which read from the serialized format directly: `tomb()`, `size()` and `begin()`/`end()`. - Add a `collection_mutation_writer` which provides container semantics for generating a serialized `collection_mutation` directly on the go (`push_back()`). * Replace all usage of `collection_mutation_description`, `collection_mutation_view_description` and friends with use of the new infrastructure. * Drop the old infrastructure, to avoid accidental regressions. Continues the work started by https://github.com/scylladb/scylladb/pull/29033 and takes it to its conclusion. To help focus review, here is a summary of the patches: * [1, 2] preparatory refactoring: drop some unused abstract_type params * [3, 6] introduce new infrastructure to write and read serialized collections directly; this is the meat of the PR * [6, -1) replace all usage of old materializing infrastructure with usage of the new one * [-1] drop old infrastructure Command: ``` dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|--------:\|--------:\|------------\| \| Throughput (median tps) \| 315,760 \| 332,021 \| +5.1% \| \| Instructions/op (median) \| 53,776 \| 48,681 \| -9.5% \| \| CPU cycles/op (median) \| 17,365 \| 16,471 \| -5.1% \| \| Allocations/op \| 85.1 \| 82.1 \| -3.5% \| Significant improvement. Throughput is up ~5%, and both instruction count and cycle count are meaningfully reduced. --- Command: ``` dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|----------:\|---------:\|-----------\| \| Throughput (median tps) \| 150,823 \| 149,678 \| -0.8% \| \| Instructions/op (median) \| 108,388 \| 103,858 \| -4.2% \| \| CPU cycles/op (median) \| 34,860 \| 35,371 \| +1.5% \| \| Allocations/op \| ~105–108 \| ~102–103 \| -3.0% \| Mixed, mostly neutral. Throughput is essentially flat (within noise). Instructions/op improved by ~4%, allocations dropped slightly, but cycles/op edged up marginally. --- Command: ``` dbuild -it -- build/release/scylla perf-alternator --workload write --developer-mode=1 --alternator-port=8000 --alternator-write-isolation=unsafe -c1 -m2G --default-log-level=error ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|--------:\|-------:\|-----------\| \| Throughput (median tps) \| 55,777 \| 56,051 \| +0.5% \| \| Instructions/op (median) \| 246,215 \|246,610 \| +0.2% \| \| CPU cycles/op (median) \| 77,641 \| 77,020 \| -0.8% \| \| Allocations/op \| 340.4 \| 335.4 \| -1.5% \| Essentially neutral. All metrics are within noise margins. Slight reduction in allocations and cycles, negligible otherwise. --- The change has a clear, substantial positive effect on reads (~5% throughput gain, ~9.5% fewer instructions per op). The write and alternator paths are unaffected in practice — changes there are within measurement noise. No regressions are apparent. This is expected: https://github.com/scylladb/scylladb/pull/29033 did the heavy lifting when it comes to the write path, this PR finishes the job, mostly improving reads. Fixes: #3602 Improvement, no backport. Closes scylladb/scylladb#29127 * github.com:scylladb/scylladb: mutation/collection_mutation: make collection_mutation::_data private mutation_collection: drop collection_mutation_description and friends test: move away from collection_mutation_description tree: move away from collection_mutation_description test: move away from collection_mutation_view::with_deserialized() tree: move away from collection_mutation_view::with_deserialized() types: fix indendation, left broken by previous commit types: move away from collection_mutation_view::with_deserialized() types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT() schema: column_computation: move away from collection_mutation_view::with_deserialized() mutation: move away from collection_mutation_view::with_deserialized() alternator: move away from collection_mutation_view::with_deserialized() cdc: move away from collection_mutation_view::with_deserialized() mutation/collection_mutation: printer: don't deserialize collections mutation/collection_mutation: difference(): don't deserialize collections mutation/collection_mutation: merge(): don't deserialize collections mutation/collection_mutation: extract compact_and_expire() to free function mutation/collection_mutation: refactor empty(), is_any_live() and last_update() compaction_garbage_collector: pass collection_mutation to collect() test/boost/mutation_test: add tests for collection_mutation_{view,writer} mutation/collaction_mutation: collection_mutation_view: add methods to inspect content mutation/collection_mutation: add collection_mutation_writer mutation/collection_mutation: collection_mutation(): generate valid collection mutation/collection_mutation: collection_mutation(): remove unused abstract_type param mutation/atomic_cell: drop unused type param from from_bytes()	2026-05-21 17:10:40 +03:00
Pavel Emelyanov	4b13b24695	Merge 's3: make S3 connection pool size configurable per scheduling group' from Ernest Zaslavsky The S3 client creates a separate HTTP connection pool per scheduling group. Previously, the pool size was hardcoded as shares/100, yielding 1-10 connections. This was not tunable and could under-provision connections for groups with low share counts. Changes - A missing include (short_streams.hh) in sstables_loader.cc is added first to fix CMake builds where the header is not transitively included. - The hardcoded per-share divisor is replaced with a per-shard connection budget. The new `object_storage_connections_per_shard` config option (default 128) specifies the total number of connections available on each shard. Connections are distributed proportionally across scheduling groups based on their shares: `max_connections = budget * group_shares / total_shares`. Remainder connections are assigned to the group with the most shares. When a new scheduling group client is created, all existing groups are rebalanced via `set_maximum_connections`. Creation and rebalance are serialized with a semaphore to prevent concurrent rebalances from racing. - The config option is made live-updateable: a `storage_manager` observer propagates changes to all existing S3 clients, triggering rebalance under the same semaphore. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1704 No backport needed since this change affects KS on object storage which is not operational yet. Closes scylladb/scylladb#29719 * github.com:scylladb/scylladb: s3: make connections_per_shard live-updateable s3: distribute connection pool proportionally across scheduling groups	2026-05-21 12:12:36 +03:00
Andrzej Jackowski	f8156702de	tree: add missing -present to copyright headers ~2076 files used "Copyright (C) YYYY-present ScyllaDB" while ~88 files used "Copyright (C) YYYY ScyllaDB". This inconsistency leads to unnecessary code review discussions and gradual spread of the less common format. Standardize all ScyllaDB copyright headers to use -present. Fixes SCYLLADB-1984 Closes scylladb/scylladb#29876	2026-05-21 10:57:42 +02:00
Botond Dénes	636e2877e2	tree: move away from collection_mutation_description Use collection_mutation_writer instead. Add to_managed_bytes() to cql3::raw_value to help avoid some copies. A special note for sstables/kl/reader.cc: this conversion is not straighforward, so we accumulate a list of cells and feed to the writer at the end. This is sub-optimal but this code is rarely used, best to be conservative.	2026-05-21 10:23:29 +03:00
Botond Dénes	35a776d043	tree: move away from collection_mutation_view::with_deserialized() Use the collection_mutation_view directly. This is the remainder after the previous patches collecting larger changes by module.	2026-05-21 10:23:29 +03:00
Botond Dénes	24fdfa34dd	mutation/collection_mutation: collection_mutation(): remove unused abstract_type param	2026-05-21 08:34:21 +03:00
Ernest Zaslavsky	86f678a592	s3: make connections_per_shard live-updateable Wire the object_storage_connections_per_shard config option as LiveUpdate so it can be changed at runtime without restart. When the value changes, the storage_manager observer propagates it to all existing S3 clients, which rebalance their connection pools under the rebalance semaphore.	2026-05-20 20:45:14 +03:00
Ernest Zaslavsky	b9e1dcc0fe	s3: distribute connection pool proportionally across scheduling groups The S3 client creates a separate HTTP connection pool per scheduling group. Previously, the pool size was computed per-group using a per-share multiplier (connections = shares * multiplier), which did not account for the total number of groups sharing the shard's connection budget. Replace the per-share multiplier with a per-shard connection budget: the new object_storage_connections_per_shard config option (default 100) specifies the total number of connections available on each shard. When a new scheduling group's client is created, connections are distributed proportionally across all groups based on their shares (connections = budget * group_shares / total_shares), and existing groups are rebalanced via set_maximum_connections. When the endpoint_config has an explicit max_connections override, it is used directly without proportional distribution.	2026-05-20 20:45:04 +03:00
Avi Kivity	6df04c9e5b	Update seastar submodule Changed seastar::http::experimental to seastar::http to reflect graduation of the seastar http API. Changed call to seastar::rename_file() (in sstables/storage.cc, sstables/sstable_directory.cc, sstable/sstables.cc and db/hints/internal/hint_storage.cc) to reflect new default parameter. Updated scylla_gdb test helper get_task() to work with updated accept loop in Seatar. This is just test code (attempts to find a task to operate on), not used in real scylla-gdb.py work, but nevertheless the adjustment keeps backward compatibility. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1798 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2043 * seastar 485a62b2...510f3148 (43): > reactor_backend: fix iocb double-free and shutdown hang during AIO teardown > file: fix default DMA alignment > http: add to_reply() to redirect_exception with extra-header support > core: propagate syscall errors via `coroutine::exception` > file: assert dma alignments are powers of two > doc: Document undocumented io_tester features and fix output example > backtrace: print the build_id along with the backtrace > reactor: default to oneline backtraces > Merge 'json: formatter: support types with user-defined conversion to sstring' from Benny Halevy tests: json_formatter: test formatter::write with string types json: formatter: support types with user-defined conversion to sstring > httpd_test: fix build failure with Seastar_SSTRING=OFF > net/tls: introduce ssl_call wrapper for SSL I/O > build: disable unused command line argument error for C++ module > coroutine/generator: fix setup of generator's waiting task > tests/tls: set 1000-day validity for self-signed CA cert > net: tls: openssl: disable certificate compression > reactor: reduce steady_clock::now() calls per scheduling quantum > fair_queue: remove notify_request_finished() > loop: use small_vector for parallel_for_each_state incomplete futures > dodge false sharing in spinlock > Merge 'Handle nowait support for reads and writes independently' from Pavel Emelyanov file: Change nowait_works mode detection file: Introduce read-only nowait_mode filesystem: Make nowait_works bit a enum class too file: Make nowait_works bit a enum class > Merge 'net/tls: improve OpenSSL error queue hygiene' from Gellért Peresztegi-Nagy net/tls: assert clean error queue before SSL operations net/tls: clear error queue after successful SSL operations net/tls: clear error queue after successful SSL_CTX_new net/tls: drain error queue on unexpected error codes net/tls: use make_openssl_error for BIO creation failure > vla.hh: add missing includes > Merge 'smp: make smp::count non-static' from Avi Kivity smp: convert all smp::count usages to instance-aware alternatives smp: add per-instance shard_count and this_smp() infrastructure disk_params: document pre-init smp::count access with explicit 0 reactor_backend: document pre-init smp::count access with explicit 0 tests: alien_test: pass shard count to alien thread explicitly > build: fix cmake missing ninja on Ubuntu 26.04 > rpc: Fix uint64 wraparound of expired timeout in send_entry() > Merge 'Generalize some RPC tests' from Pavel Emelyanov tests: Generalize async connection-based scheduling RPC tests tests: Generalize sync connection-based scheduling RPC tests tests: Remove redundant variadic/nonvariadic RPC tuple tests tests: Generalize max timeout RPC tests > net: tls: openssl: Share BIO ptrs across shards > http: fix compilation on clang 22 with c++26 > build: openssl tools needed for test cert generation > reactor: support rename2 > future: fix forwarding of reference types > Merge 'Zero-copy http chunked data sink' from Pavel Emelyanov http: Make chunked data sink zero-copy tests/prometheus_http: Rewrite on top of http::client tests/httpd: Rewrite content_length_limit on top of http::client > tests: Replace ad-hoc http_consumer with production HTTP parser > Merge 'co_return to accept same expressions and types as return' from Alexey Bashtanov tests/unit/{coroutines,futures}: strict types on co_return and set_value api: introduce version 10: core/{coroutine,future}: make `co_return` more strict with types core/{coroutine,future}: preparations to fix `co_return` type semantics > Merge 'Perftune.py: add special handling for mlx5 rss queues number calculation' from Vladislav Zolotarov perftune.py: NetPerfTuner: enhance RSS (a.k.a. "Rx") queues accounting for mlx5 devices perftune.py: update docstring of NetPerfTuner.__get_rps_cpus() method perftune.py: add a method that parses and models the output of the 'ethtool -l' command for a given interface > httpd: rewrite do_accepts/do_accept_one as coroutines > file: add mmap support to file > http: Move client code out of experimental namespace > file: add hugetlbfs support to file system detection > tests: Replace test_source_impl with util::as_input_stream > tests: Replace buf_source_impl with util::as_input_stream > Merge 'rpc_tester: expose throuput for rpc tester' from Marcin Szopa rpc_tester: remove unused payload size variable from job_rpc_streaming class rpc_tester: add start time tracking for throughput calculation, print throughput and msg/s for job_rpc rpc_tester: refactor result emission to use dedicated functions for messages and throughput > iostream: cast first argument of `std::min` to `size_t` Closes scylladb/scylladb#29952	2026-05-20 13:47:12 +03:00
Pavel Emelyanov	1fd4fc906b	sstables: Use inject_parameter() for mx reader fill buffer timeout Replace the enter() + get_injection_parameters() + map subscript pattern with inject_parameter() which returns an optional and naturally serves as both the enabled check and parameter extraction. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-19 18:22:53 +03:00
Pavel Emelyanov	1c0f8ab66e	Merge 'sstables: introduce --abort-on-malformed-sstable-error' from Botond Dénes When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site. This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths: - Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`) - Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`) - `parse_assert()` failures (via `on_parse_error()`) - BTI parse errors (via `on_bti_parse_error()`) The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure. The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption. Commit breakdown: 1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc` 2. `on_parse_error()` and `on_bti_parse_error()` check the new flag 3. All ~50 `throw malformed_sstable_exception(...)` sites migrated 4. Both `throw bufsize_mismatch_exception(...)` sites migrated Refs: SCYLLADB-1087 Backport: new feature, no backport Closes scylladb/scylladb#29324 * github.com:scylladb/scylladb: sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception() sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose sstables: introduce --abort-on-malformed-sstable-error infrastructure sstables: refactor parse_path() to return std::expected<> instead of throwing	2026-05-12 12:38:25 +03:00
Dimitrios Symonidis	94bc0245f9	sstables, utils/s3: reuse caller-provided file in s3_storage::make_source s3_storage::make_source previously ignored its file f parameter and constructed a fresh s3::client::readable_file per call. The new file's _stats cache was empty, so the first dma_read_bulk issued a HEAD via maybe_update_stats just to learn the object size before the ranged GET -- one ~50 ms RTT per uncached read. The file f passed in by the two callers (sstable::data_stream for Data.db reads and index_reader::make_context for Index.db reads) already wraps the sstable's _data_file or _index_file. Those file objects had their stats populated at sstable open time by update_info_for_opened_data, and they were wrapped with the configured file_io_extensions when opened via open_component. Reusing them is exactly what filesystem_storage::make_source does (one-line make_file_data_source over f), so the s3 path simply matches it. readable_file::size() is also updated to route through maybe_update_stats(), so a .size() call populates the _stats cache the same way .stat() does -- preventing a redundant HEAD on the first subsequent read of components opened with .size() (Index, Partitions, Rows in update_info_for_opened_data). Closes scylladb/scylladb#29766 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 12:38:24 +03:00
Botond Dénes	cf37f541a0	Merge ' sstables_loader: ensure upload directory is empty when load_and_stream returns' from Taras Veretilnyk After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`) returns success, source sstable files in the `upload/` directory may still be on disk. `mark_for_deletion()` only sets an in-memory flag; the actual file deletion runs lazily when the last `shared_sstable` reference drops. This leaves a window between API success and physical deletion where a follow-up scan of the upload directory can detected sstables that will be deleted soon. This might cause failure because SSTable will be already wiped during processing. For fix: Force unlink to complete before `stream()` returns, so the upload directory is in a consistent state by the time the API reports success. For tablet streaming, partially-contained sstables participate in multiple per-tablet batches; eagerly unlinking after each batch would break the next batch that still needs to read the file. A `defer_unlinking` flag on the streamer postpones the explicit unlink until after all batches complete (called once at the end of `tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of `stream_sstable_mutations`. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647 Backport is required, as it is a bug fix that was introduced in `517a4dc4df`. Closes scylladb/scylladb#29599 * github.com:scylladb/scylladb: sstables_loader: synchronously unlink streamed sstables before returning sstables: make sstable::unlink() idempotent	2026-05-11 14:43:46 +03:00
Botond Dénes	ad7ac62835	Merge ' Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key' from Dimitrios Symonidis Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation). This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562 No need to backport this, keyspace over object storage is experimental feature Closes scylladb/scylladb#29659 * github.com:scylladb/scylladb: db, sstables: add node_owner to sstables registry primary key db, sstables: rename sstables registry column owner to table_id	2026-05-11 14:08:19 +03:00
Botond Dénes	2edfb91070	sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception() Replace the two remaining direct 'throw bufsize_mismatch_exception(...)' call sites with the new throw_bufsize_mismatch_exception() helper, which routes through throw_malformed_sstable_exception() and thus also respects the --abort-on-malformed-sstable-error flag. Affected files: - sstables/sstables.cc (1 site, in check_buf_size()) - sstables/m_format_read_helpers.cc (1 site, in check_buf_size())	2026-05-11 11:58:14 +03:00
Botond Dénes	d65c1523c2	sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() Replace all direct 'throw malformed_sstable_exception(...)' call sites with the new throw_malformed_sstable_exception() helper, which respects the --abort-on-malformed-sstable-error flag.	2026-05-11 11:58:14 +03:00
Botond Dénes	84c27658d9	sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error Both functions now check abort_on_malformed_sstable_error() first. If set, they log the error and call std::abort() directly, generating a coredump. Otherwise they fall through to the existing on_internal_error() path, which is in turn controlled by --abort-on-internal-error.	2026-05-11 11:58:14 +03:00
Botond Dénes	4ebcc002d6	sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose Add scoped_no_abort_on_malformed_sstable_error RAII guard (modeled after seastar::testing::scoped_no_abort_on_internal_error) and use it in all tests that intentionally corrupt sstables and expect malformed_sstable_exception to be thrown rather than the process aborting.	2026-05-11 11:58:14 +03:00
Botond Dénes	f6dc2cb5f8	sstables: introduce --abort-on-malformed-sstable-error infrastructure Add the --abort-on-malformed-sstable-error command-line option and the supporting infrastructure. When set, any malformed sstable error will abort the process and generate a coredump instead of throwing an exception. This is useful for debugging memory corruption that may manifest as apparent sstable corruption. The implementation introduces: - throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception() helper functions in sstables/sstables.cc, which check the new flag and either abort (with logging) or throw the appropriate exception. - set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error() to control the per-process atomic flag. - abort_on_malformed_sstable_error config option (LiveUpdate, default false) wired up in main.cc alongside abort_on_internal_error. Call-site migration will follow in subsequent commits.	2026-05-11 11:58:14 +03:00
Botond Dénes	c3daa6379c	sstables: refactor parse_path() to return std::expected<> instead of throwing make_entry_descriptor() and the two overloads of parse_path() used to signal parse failures by throwing malformed_sstable_exception, which made parse_path() expensive to use as a probe (e.g. to classify directory entries). Change make_entry_descriptor() and both parse_path() overloads to return std::expected<T, sstring>, where the sstring carries the error message on failure, eliminating the exception overhead at probe call sites. Call sites that previously caught malformed_sstable_exception to treat the path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc, tools/scylla-sstable.cc) now check the expected result directly. Call sites where a parse failure is a genuine error (sstable_directory.cc, sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw explicitly as malformed_sstable_exception using the error string, preserving the existing error propagation behaviour.	2026-05-11 11:58:14 +03:00
Yaniv Kaul	e29f59347b	sstables: fix missing format placeholders in error messages Fix three format string bugs: - partition_reversing_data_source.cc: _row_start was passed as an argument but had no {} placeholder in the invariant error message. Add {} for all three values to show the full diagnostic. - reader.cc: two "Invalid boundary type" error messages passed the type value as an argument but had no {} placeholder, so the actual invalid type was never shown. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2026-05-10 17:51:19 +03:00
Botond Dénes	d0813769ec	sstables/trie: add preemption points in trie_writer The BTI partition index trie writer flushes all buffered nodes at the end of each SSTable via complete_until_depth(0), called from bti_partition_index_writer_impl::finish(). This is a tight synchronous loop that writes trie nodes through file_writer::write(), which uses a buffered output_stream: individual writes that fit in the buffer are plain memcpy operations returning a ready future, so .get() never yields. As a result the reactor can stall for several milliseconds on large SSTables. The entire call chain runs inside seastar::async() (via sstable::write_components()), so seastar::thread::maybe_yield() is safe to call here. Add it at the top of both tight loops: - complete_until_depth(), which iterates over trie depth - lay_out_children(), which iterates over child branches per node Fixes SCYLLADB-1885 Closes scylladb/scylladb#29798	2026-05-10 11:30:59 +03:00
Łukasz Paszkowski	7e14ea5ac8	sstables: only wipe TemporaryHashes for sstable formats that have it Commit `8d34127684` ("sstables: clean up TemporaryHashes file in wipe()") unconditionally calls filename(..., component_type::TemporaryHashes) inside filesystem_storage::wipe(). However, the TemporaryHashes component is only registered in the component map of the 'ms' sstable format. For older formats (ka, la, mc, md, me) the lookup goes through sstable_version_constants::get_component_map(version).at(...) and throws std::out_of_range. The exception is then swallowed by the outer catch(...) in wipe(), which just logs and ignores. As a side effect, the subsequent remove_file(new_toc_name) is never reached and the TemporaryTOC ('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink() of a non-'ms' sstable. Guard the lookup with get_component_map(version).contains() so the cleanup is only attempted for formats that actually define the component. Add a regression test in test/boost/sstable_directory_test.cc that creates an 'me'-format sstable, unlinks it and asserts that the sstable directory is left empty. Without the fix the test fails with a leftover 'me-...-TOC.txt.tmp' file. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1697 Closes scylladb/scylladb#29620	2026-04-29 08:06:36 +03:00
Taras Veretilnyk	784127c40b	sstables_loader: synchronously unlink streamed sstables before returning mark_for_deletion() only set an in-memory flag; the actual file deletion ran lazily when the last shared_sstable reference dropped, leaving a window in which a follow-up scan of the upload directory (e.g. a second 'nodetool refresh --load-and-stream') could observe a partially-deleted sstable and fail with malformed_sstable_exception. Force the unlink to complete before stream() returns. For tablet streaming, partially-contained sstables span multiple per-tablet batches, so a defer_unlinking flag postpones the unlink until after all sstables are streamed; for vnodes and fully-contained sstables are streamed only once and could be removed just after being streamed. Added a FIXME on object_storage_base::wipe and strengthened the doc on storage::wipe to make the never-fails contract explicit	2026-04-28 14:52:28 +02:00
Dimitrios Symonidis	c40842f60a	db, sstables: add node_owner to sstables registry primary key Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomes PRIMARY KEY ((table_id, node_owner), generation). This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows. The new column is populated via sstables_manager::get_local_host_id(). No backward compatibility is preserved; the feature is experimental and gated by keyspace-storage-options.	2026-04-24 16:41:09 +02:00
Dimitrios Symonidis	ce78c5113e	db, sstables: rename sstables registry column owner to table_id The partition-key column in system.sstables named 'owner' actually holds a table_id. Rename the CQL column and the matching C++ parameter and member names so the identifier describes what it stores. No behavior change. This prepares the schema for an upcoming node_owner partition-key column (the local host id), which needs a free name.	2026-04-24 16:24:07 +02:00
Taras Veretilnyk	7cdf215999	sstables: make sstable::unlink() idempotent Avoid duplicate work when unlink() is called more than once on the same sstable. This happens when a caller invokes unlink() explicitly on an sstable that is also marked for deletion: the destructor's close_files() path would otherwise call unlink() again, re-firing _on_delete, double-counting _stats.on_delete() and double-invoking _manager.on_unlink().	2026-04-21 22:41:02 +02:00
Botond Dénes	cfebe17592	sstables: fix segfault in parse_assert() when message is nullptr parse_assert() accepts an optional `message` parameter that defaults to nullptr. When the assertion fails and message is nullptr, it is implicitly converted to sstring via the sstring(const char*) constructor, which calls strlen(nullptr) -- undefined behavior that manifests as a segfault in __strlen_evex. This turns what should be a graceful malformed_sstable_exception into a fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered parse_assert() during streaming (in continuous_data_consumer:: fast_forward_to()), causing a crash loop on the affected node. Fix by guarding the nullptr case with a ternary, passing an empty sstring() when message is null. on_parse_error() already handles the empty-message case by substituting "parse_assert() failed". Fixes: SCYLLADB-1329 Closes scylladb/scylladb#29285	2026-04-21 12:40:33 +02:00
Botond Dénes	69c58c6589	Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage. The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it. This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901 The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions. Closes scylladb/scylladb#28873 * github.com:scylladb/scylladb: streaming: reject mutation fragments on critical disk utilization test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection sstables: clean up TemporaryHashes file in wipe() sstables: add error injection point in write_components test/cluster/storage: extract validate_data_existence to module scope test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity utils/disk_space_monitor: add error injection to suppress threshold checks	2026-04-20 17:56:36 +03:00
Botond Dénes	57f8be49e9	Merge 'Move ignore_component_digest_mismatch flag on sstables_manager' from Pavel Emelyanov The PR serves two purposes. First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes. Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one. Code cleanup, not backporting Closes scylladb/scylladb#29513 * github.com:scylladb/scylladb: sstables: Remove ignore_component_digest_mismatch from sstable_open_config sstables: Move ignore_component_digest_mismatch initialization to constructor sstables: Add ignore_component_digest_mismatch to sstables_manager config	2026-04-17 12:54:17 +03:00
Łukasz Paszkowski	4657d9e32c	streaming: reject mutation fragments on critical disk utilization The stream_mutation_fragments RPC handler did not check is_in_critical_disk_utilization_mode before accepting incoming mutation fragments. This meant load-and-stream (nodetool refresh --load-and-stream) could push data onto a node at critical disk utilization, potentially filling the disk completely. Add a critical disk utilization check in the get_next_mutation_fragment lambda, throwing critical_disk_utilization_exception when the node is in critical mode. This mirrors the existing protection in stream_blob.cc. Also remove the xfail marker from the corresponding test added in the previous commit.	2026-04-17 09:31:26 +02:00
Botond Dénes	88a8324e68	erge 'db: store large data records in SSTable metadata and serve via virtual tables' from Benny Halevy `system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables. This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades. When the cluster feature is enabled, each node drops the old system large_ tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables. Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records. 1. keys: move key_to_str() to keys/keys.hh — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable 2. sstables: add LargeDataRecords metadata type (tag 13) — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation 3. large_data_handler: rename partition_above_threshold to above_threshold_result — generalize the struct for reuse 4. large_data_handler: return above_threshold_result from maybe_record_large_cells — separate booleans for cell size vs collection elements thresholds 5. sstables: populate LargeDataRecords from writer — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable` 6. test: add LargeDataRecords round-trip unit tests — verify write/read, top-N bounding, below-threshold behavior 7. db: call initialize_virtual_tables from shard 0 only — preparatory refactoring to enable cross-shard coordination 8. db: implement large_data virtual tables with feature flag gating — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276 * Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport Closes scylladb/scylladb#29257 * github.com:scylladb/scylladb: db: implement large_data virtual tables with feature flag gating db: call initialize_virtual_tables from shard 0 only test: add LargeDataRecords round-trip unit tests sstables: populate LargeDataRecords from writer large_data_handler: return above_threshold_result from maybe_record_large_cells large_data_handler: rename partition_above_threshold to above_threshold_result sstables: add LargeDataRecords metadata type (tag 13) sstables: add fmt::formatter for large_data_type keys: move key_to_str() to keys/keys.hh	2026-04-16 14:03:31 +03:00
Pavel Emelyanov	4d352c7cf5	sstables: Remove ignore_component_digest_mismatch from sstable_open_config The ignore_component_digest_mismatch flag is now initialized at sstable construction time from sstables_manager::config (which is populated from db::config at boot time). Remove the flag from sstable_open_config struct and all call sites that were setting it explicitly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 13:49:14 +03:00
Pavel Emelyanov	9107e055b3	sstables: Move ignore_component_digest_mismatch initialization to constructor Initialize the ignore_component_digest_mismatch flag from sstables_manager::config in the sstable constructor initializer list instead of in load(). This ensures the flag value is set at construction time when the manager config is available, rather than at load time. Mark the member const to reflect its immutability after construction. Fixes the bootstrap path which now correctly reads the flag from manager config initialized from db::config at boot time, instead of using the default value. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 13:49:00 +03:00
Pavel Emelyanov	8abfd9af00	sstables: Add ignore_component_digest_mismatch to sstables_manager config Copy the ignore_component_digest_mismatch flag from db::config to sstables_manager::config during database initialization. This makes the flag available early in the boot process, before SSTables are loaded, enabling later commits to move the flag initialization from load-time to construction-time. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 13:48:49 +03:00
Łukasz Paszkowski	8d34127684	sstables: clean up TemporaryHashes file in wipe() The TemporaryHashes.db.tmp file is created during SSTable writing to store intermediate bloom filter hashes and is deleted before the SSTable is sealed. Since it is not tracked in the TOC, it is also absent from _recognized_components and all_components(). When an SSTable write fails before sealing (e.g. streaming rejected due to critical disk utilization), wipe() is called to clean up the partial SSTable. However, wipe() only iterates over all_components(), so the TemporaryHashes file was left behind as an orphan. Previously, the only cleanup mechanism for this file was the startup-time directory scanner in sstable_directory, which would not help when the orphan needs to be cleaned up at runtime. Explicitly remove the TemporaryHashes file in wipe(), ignoring ENOENT for the common case where the file was already removed before sealing.	2026-04-16 08:38:34 +02:00
Łukasz Paszkowski	159675e975	sstables: add error injection point in write_components Add a `write_components_writer_created` error injection point in `sstable::write_components()` between writer creation and fragment consumption. This injection is needed by the out-of-space streaming test (added in the next patch) to reliably pause SSTable writing at the right moment: after the SSTable writer has been created and files exist on disk, but before mutation fragments are consumed. Pausing earlier (before writer creation) would not work because there are no files on disk yet, while pausing later (after consuming fragments) would be too late to reliably push the node into critical disk utilization.	2026-04-16 08:38:34 +02:00
Benny Halevy	1f7faeef57	sstables: populate LargeDataRecords from writer During compaction (SSTable writing), maintain bounded min-heaps (one per large_data_type) that collect the top-N above-threshold records. On stream end, drain all five heaps into a single LargeDataRecords array and write it into the SSTable's scylla metadata component. Five separate heaps are used: - partition_size, row_size, cell_size: ordered by value (size bytes) - rows_in_partition, elements_in_collection: ordered by elements_count A new config option 'compaction_large_data_records_per_sstable' (default 10) controls the maximum number of records kept per type.	2026-04-16 08:49:02 +03:00
Benny Halevy	8f4976f65d	large_data_handler: return above_threshold_result from maybe_record_large_cells Change maybe_record_large_cells to return above_threshold_result with separate booleans for cell size (.size) and collection elements (.elements) thresholds. This allows the writer to track above_threshold counts for cell_size and elements_in_collection independently.	2026-04-16 08:49:02 +03:00
Benny Halevy	c1b797f288	large_data_handler: rename partition_above_threshold to above_threshold_result Rename partition_above_threshold to above_threshold_result and its 'rows' field to 'elements', making it a generic struct that can be reused for other large data types (e.g., cells with collection elements). Use designated initializers for clarity.	2026-04-16 08:49:02 +03:00
Benny Halevy	d92cd42fe6	sstables: add LargeDataRecords metadata type (tag 13) Add a new scylla metadata component LargeDataRecords (tag 13) that stores per-SSTable top-N large data records. Each record carries: - large_data_type (partition_size, row_size, cell_size, etc.) - binary serialized partition key and clustering key - column name (for cell records) - value (size in bytes) - element count (rows or collection elements, type-dependent) - range tombstones and dead rows (partition records only) The struct uses disk_string<uint32_t> for key/name fields and is serialized via the existing describe_type framework into the SSTable Scylla metadata component. Add JSON support in scylla-sstable and format documentation.	2026-04-16 08:49:01 +03:00
Benny Halevy	85e2c6f2a7	sstables: add fmt::formatter for large_data_type Add a fmt::formatter specialization for sstables::large_data_type and use it in scylla-sstable.cc instead of the local to_string() overload, which is removed.	2026-04-16 08:42:54 +03:00
Dimitrios Symonidis	24a7b146fa	sstables/utils/s3: split config update into sync and async parts Config observers run synchronously in a reactor turn and must not suspend. Split the previous monolithic async update_config() coroutine into two phases: Sync (runs in the observer, never suspends): - S3: atomically swap _cfg (lw_shared_ptr) and set a credentials refresh flag. - GCS: install a freshly constructed client; stash the old one for async cleanup. - storage_manager: update _object_storage_endpoints and fire the async cleanup via a gate-guarded background fiber. Async (gate-guarded background fiber): - S3: acquire _creds_sem, invalidate and rearm credentials only if the refresh flag is set. - GCS: drain and close stashed old clients.	2026-04-15 14:28:31 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Pavel Emelyanov	e0fa9ee332	Merge 'storage: implement sstable clone for object storage' from Ernest Zaslavsky This patch series implements `object_storage_base::clone`, which was previously a stub that aborted at runtime. Clone creates a copy of an sstable under a new generation and is used during compaction. The implementation uses server-side object copies (S3 CopyObject / GCS Objects: rewrite) and mirrors the filesystem clone semantics: TemporaryTOC is written first to mark the operation as in-progress, component objects are copied, and TemporaryTOC is removed to commit (unless the caller requested the destination be left unsealed). The first two patches fix pre-existing bugs in the underlying storage clients that were exposed by the new clone code path: - GCS `copy_object` used the wrong HTTP method (PUT instead of POST) and sent an invalid empty request body. - S3 `copy_object` silently ignored the abort_source parameter. 1. gcp_client: fix copy_object request method and body — Fix two bugs in the GCS rewrite API call. 2. s3_client: pass through abort_source in copy_object — Stop ignoring the abort_source parameter. 3. object_storage: add copy_object to object_storage_client — New interface method with S3 and GCS implementations. 4. storage: add make_object_name overload with generation — Helper for building destination object names with a different generation. 5. storage: make delete_object const — Needed by the const clone method. 6. storage: implement object_storage_base::clone — The actual clone implementation plus a copy_object wrapper. 7. test/boost: enable sstable clone tests for S3 and GCS — Re-enable the previously skipped tests. A test similar to `sstable_clone_leaving_unsealed_dest_sstable` was added to properly test the sealed/unsealed states for object storage. Works for both S3 and GCS. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1045 Prerequisite: https://github.com/scylladb/scylladb/pull/28790 No need to backport since this code targets future feature Closes scylladb/scylladb#29166 * github.com:scylladb/scylladb: compaction_test: enable sstable clone tests for S3 and GCS storage: implement object_storage_base::clone storage: make delete_object const in object_storage_base storage: add make_object_name overload with generation sstables: add get_format() accessor to sstable object_storage: add copy_object to object_storage_client s3_client: pass through abort_source in copy_object gcp_client: fix copy_object request method and body	2026-04-08 09:35:10 +03:00
Ernest Zaslavsky	7cd9bbb010	storage: implement object_storage_base::clone Implement the clone method for object_storage_base, which creates a copy of an sstable with a new generation using server-side object copies. Also add a const copy_object convenience wrapper, similar to the existing put_object and delete_object wrappers. A dedicated test for the new object storage clone path will be added in the following commit. The preexisting local-filesystem clone is already covered by the sstable_clone_leaving_unsealed_dest_sstable test.	2026-04-07 18:16:52 +03:00

1 2 3 4 5 ...

4107 Commits