`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.
This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.
When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.
1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276
* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport
Closesscylladb/scylladb#29257
* github.com:scylladb/scylladb:
db: implement large_data virtual tables with feature flag gating
db: call initialize_virtual_tables from shard 0 only
test: add LargeDataRecords round-trip unit tests
sstables: populate LargeDataRecords from writer
large_data_handler: return above_threshold_result from maybe_record_large_cells
large_data_handler: rename partition_above_threshold to above_threshold_result
sstables: add LargeDataRecords metadata type (tag 13)
sstables: add fmt::formatter for large_data_type
keys: move key_to_str() to keys/keys.hh
During compaction (SSTable writing), maintain bounded min-heaps (one per
large_data_type) that collect the top-N above-threshold records. On
stream end, drain all five heaps into a single LargeDataRecords array
and write it into the SSTable's scylla metadata component.
Five separate heaps are used:
- partition_size, row_size, cell_size: ordered by value (size bytes)
- rows_in_partition, elements_in_collection: ordered by elements_count
A new config option 'compaction_large_data_records_per_sstable' (default
10) controls the maximum number of records kept per type.
Config observers run synchronously in a reactor turn and must not
suspend. Split the previous monolithic async update_config() coroutine
into two phases:
Sync (runs in the observer, never suspends):
- S3: atomically swap _cfg (lw_shared_ptr) and set a credentials
refresh flag.
- GCS: install a freshly constructed client; stash the old one for
async cleanup.
- storage_manager: update _object_storage_endpoints and fire the
async cleanup via a gate-guarded background fiber.
Async (gate-guarded background fiber):
- S3: acquire _creds_sem, invalidate and rearm credentials only if
the refresh flag is set.
- GCS: drain and close stashed old clients.
The new parameter parametrizes the factor used to reject a read
during admission. Its value shall be between 0.0 and 1.0 where
+ 0.0 means a read will never get rejected during admission
+ 1.0 means a read will immediatelly get rejected during admission
Although passing values outside the interaval is possible, they
will have the exact same effects as they were clamped to [0.0, 1.0].
The commit 59faa6d, introduces a new parameter called cpu_concurrency
and sets its default value to 1 which violates the commit fbb83dd that
removes all default values from constructors but one used by the unit
tests.
The patch removes the default value of the cpu_concurrency parameter
and alters tests to use the test dedicated reader_concurrency_semaphore
constructor wherever possible.
Adds a "sstables" array member to manifest.json.
For each sstables, keep the following metadata:
id - a uuid for the sstable (the sstable identifier
if the use-sstable-identifier option was used, otherwise
the sstable uuid generation)
toc_name - the name of the TOC.txt file
data_size and index_size - in bytes
first_token and last_token - of the sstable first and last keys.
Fixes: SCYLLADB-196
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.
The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This reverts commit 1bb897c7ca, reversing
changes made to 954f2cbd2f. It makes
incompatible changes to the object storage configuration format, breaking
tests [1]. It's likely that it doesn't break any production configuration,
but we can't be sure.
Fixes#27966Closesscylladb/scylladb#27969
The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself.
Closesscylladb/scylladb#27762
* github.com:scylladb/scylladb:
table: Move snapshot_file_set to table.cc
table: Rename and move snapshot_on_all_shards() method
table: Ditch jsondir variable
table, sstables: Pass snapshot name to sstable::snapshot()
table: Use snapshot_writer in write_manifest()
table: Use snapshot_writer in write_schema_as_cql()
table: Add snapshot_writer::sync()
table: Add snapshot_writer::init()
table: Introduce snapshot_writer
table: Move final sync and rename seal_snapshot()
table: Hide write_schema_as_cql()
table: Hide table::seal_snapshot()
table: Open-code finalize_snapshot()
table: Fix indentation after previuous patch
table: Use smp::invoke_on_all() to populate the vector with filenames
table: Don't touch dir once more on seal_snapshot()
table: Open-code table::take_snapshot() into caller lambda
table: Move parts of table::take_snapshot to sstables_manager
table: Introduce table::take_snapshot()
table: Store the result of smp::submit_to in local variable
The db::config::object_storage_endpoints parameter is live-updateable, but when the update really happens, the new endpoints may fail to propagate to non-zero shards because of the way db::config sharding is implemented.
Refs: #7316Fixes: #26509
Backport to 2025.3 and 2025.4, AFAIK there are set ups with object storage configs for native backup
Closesscylladb/scylladb#27689
* github.com:scylladb/scylladb:
sstables/storage_manager: Fix configured endpoints observer
test/object_store: Add test to validate how endpoint config update works
On start the manager creates observer for object_storage_endpoints
config parameter. The goal is to refresh the maintained set of endpoint
parameters and client upon config change. The observer is created on
shard 0 only, and when kicked it calls manager.invoke-on-all to update
manager on all shards.
However, there's a race here. The thing is that db::config values are
implicitly "sharded" under the hood with the help of plain array. When
any code tries to read a value from db::config::something, the reading
code secretly gets the value from this inner array indexed by the
current shard id.
Next, when the config is updated, it first assigns new values to [0]
element of the hidden array, then calls broadcast_to_all_shards() helper
that copies the valaues from zeroth slot to all the others. But the
manager's observer is triggered when the new value is assigned on zero
index, and if the invoke-on-all lambda (mentioned above) happens to be
faster than broadcast_to_all_shards, the non-zero shards will read old
values from db::config's inner array.
The fix is to instantiate observer on all shards and update only local
shard, whenever this update is triggered.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently sstable::snapshot() is called with directory name where to put
snapshots into. This patch changes it to accept snapshot name instead.
This makes the table-sstable API be unware of snapshot destination
storage type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Move the loop over vector of sstables that calls sstable->snapshot()
into sstables manager.
This makes it symmetric with sstables_manager::delete_atomically() and
allows for future changes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
To keep backward compatibility, support
- old configs -- where endpoint is just an address and port is separate.
When it happens, format the "new" endpoint name
- lookup by address-only. If it happens, scan all endpoints and see if
any one matches the provided address
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tool may open sstables on S3. For that it gets configured endpoints
with the help of db::config obtained from sstables_manager.db_config().
However, storage endpoints are maintained by sstables storage manager,
and since tool has this instance, it's better to use storage manager to
get list of endpoints.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently manager holds a reference on db::config and when sstables IO
extensions are needed it grabs them from this config. Since db::config
is going to be removed from sstables manager, it should either keep
track of all config extensions, or only those that it needs. This patch
makes the latter choice and keeps reference to sstable_file_io_ext. on
manager. The reference is passed as constructor argument, not via
manager config, but it's a random choice, no specific reason why not
putting it on config itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently all the code that needs to iterate over sstables extensions
get config from manager, extensions from it and then iterate. Add a
helper that returns extensions directly. No real changes, just a helper.
Next patch will change the way the helper works.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's explicitly `me` type by default, but places that can write sstables
override it with db::config value: replica::database, tests and scylla
sstable tool.
Live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only the
replica::database and tests may want to re-configure it.
This one is live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only the
replica::database may want to re-configure it.
This one is live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it OFF by default and update only those callers, that may have it
ON -- the replica::database, tests and scylla-sstable tool.
Also not live-updateable, so plain bool.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, this parameter is passed to sstables_manager as explicit
constructor argument.
Also, it's not live-updateable, so a plain size_t type for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is specific configuration for sstables_manager. All places that
construct sstables manager are updated to provide config to it. For now
the config is empty and exists alongside with db::config. Further
patches will populate the former config with data and the latter config
will be eventually removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it uses db::config. Next patches will eliminate db::config from this
code and the helper in question will need to get datadir names
explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The config() method name is going to return sstables_manager config, so
first need to set this name free.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Moves the config wrapper to own file (to reduce recompilation for modifying)
and refactors to handle extending this parameter to non-s3 endpoint configs.
Extend the `sstable_format` config enum with a "ms" value,
and, if it's enabled (in the config and in cluster features),
use it for new sstables on the node.
(Before this commit, writing `ms` sstables should only be possible
in unit tests, via internal APIs. After this commit, the format
can be enabled in the config and the database will write it during
normal operation).
As of this commit, the new format is not the default yet.
(But it will become the default in a later commit in the same series).
Our sstable format selection logic is weird, and hard to follow.
If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
doesn't do anything, but in the past it used to disable
cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
etc).
3. There is a cluster feature for each version:
ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
(Currently all sstable version features have been grandfathered,
and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
latest enabled sstable version. (Why? Isn't this directly derived
from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
sstable version to be used for new writes.
This field is updated by `sstables_format_selector`
based on cluster features and the `system.scylla_local` entry.
I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
data. For example, range tombstones with an infinite bound are only
supported by sstables since version "mc". So if a range tombstone
with an infinite bound exists somewhere in the dataset,
the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
is enabled. (Otherwise new sstables might become unreadable if they
are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.
So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.
The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.
This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
After the patch we no longer update or read it.
As far as I understand, the purpose of this entry is to prevent
unwanted format downgrades (which is something cluster features
are designed for) and it's updated if and only if relevant
cluster features are updated. So there's no reason to have it,
we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
Without the `system.scylla_local` around, it's just a glorified
feature listener.
3. The format selection logic is moved into `sstable_manager`.
It already sees the `db::config` and the `gms::feature_service`.
For the foreseeable future, the knowledge of enabled cluster features
and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
inhibit cluster features. Instead, it is intended to select the
format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
(which used to be set by `sstables_format_selector`) we write
them with the "preferred" format, which is determined by
`sstable_manager` based on the combination of enabled features
and the current value of `sstable_format`.
Closesscylladb/scylladb#26092
[avi: Pavel found the reason for the scylla_local entry -
it predates stable storage for cluster features]
The memory usage is tracked with the help of a semaphore, so just export
its "consumed" units.
One tricky place here is the need to skip metrics registration for
scylla-sstable tool. The thing is that the tools starts the storage
manager and sstables manager on start and then some of tool's operations
may want to start both managers again (via cql environment) causing
double metrics registration exception.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25769
It is returning constant sstables::uuid_identifiers::yes now,
so let the callers just use the constant (to be dropped
in a following patch).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The feature is supported by all live versions since
version 5.4 / 2024.1.
(Although up to 6da758d74c
it could be disabled using the config option)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.
1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.
So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.
Fixes#23771.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24426
Allow other submodules to subscribe for added/deleted
notifications. This will be used in a later to
patch to prioritize unlinked sstables for backup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move `object_storage.yaml` endpoints to `scylla.yaml`
This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.
Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.
This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f
Refs https://github.com/scylladb/scylladb/issues/22428Closesscylladb/scylladb#22952
* github.com:scylladb/scylladb:
Remove db::config::object_storage_config
Move `object_storage.yaml` endpoints to `scylla.yaml`
Create a `sstable_compressor_factory_impl` in `scylla_main`,
and pipe it through constructors into `sstables_manager`.
In next commits, the factory available through the `sstables_manager`
will be used to create compressors for SSTable readers and writers.
That map became redundant once we added
object_storage_endpoints in the config, this patch removes
it and switches all the user code to use the new option.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.
This patch updates the sstable manager to subscribe to any changes in
the above mentioned config value and immediately trigger the
reclaim/reload fiber when a change occurs. Also, adds a testcase to
verify the fix.
Fixes#21947
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Renamed the aboved mentioned method to `increment_total_reclaimable_memory()`
as it doesn't directly reclaim memory anymore.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>