We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.
Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.
Fixes#25919
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This reverts commit 866c96f536, reversing
changes made to 367633270a.
This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.
Fixes: #27496Closesscylladb/scylladb#27518
This reverts commit 8192f45e84.
The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.
The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
Tablet migration transfers sstable files without changing origin
host-id. As it should, becuase those sstables were not written on the
destination host, and should be ignored by commit log replay.
So it's a normal situation, and it's confusing to see this warning in
logs.
Fixes#26957Closesscylladb/scylladb#27433
The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object.
To get the type, the method scans db::config option, but there's much simpler way to get one.
Code cleanup, no need to backport
Closesscylladb/scylladb#27381
* github.com:scylladb/scylladb:
sstables_loader: Provide endpoint type for get_sstables_from_object_store()
storage_manager: Introduce get_endpoint_type() method
storage_manager: Split get_endpoint_client()
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.
This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.
Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.
However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.
To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).
Fixes#27181
* New feature, no backport required
Closesscylladb/scylladb#27184
* github.com:scylladb/scylladb:
database: truncate_table_on_all_shards: set use_sstable_identifier to true
nodetool: snapshot: add --use-sstable-identifier option
api: storage_service: take_snapshot: add use_sstable_identifier option
test: database_test: add snapshot_use_sstable_identifier_works
test: database_test: snapshot_works: add validate_manifest
sstable: write_scylla_metadata: add random_sstable_identifier error injection
table: snapshot_on_all_shards: take snapshot_options
sstable: add get_format getter
sstable: snapshot: add use_sstable_identifier option
db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
db: snapshot_ctl: move skip_flush to struct snapshot_options
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:
lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;
Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.
This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.
Fixes#25799Closesscylladb/scylladb#27470
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in the sstable structure and later persisted to disk
as part of the Scylla metadata component during writer::consume_end_of_stream.
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used by a unit test in the following patch for testing
the snapshot use_sstable_identifier option.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used by the snapshot code in te following patch
for manufacturing a basename using the sstable_id rather
than its generation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When set to true, use the sstable_identifier as the sstable name
in the snapshot rather than its generation.
sstable::snapshot now returns the generation it used
for the sstable in the snapshot, based on the `use_sstable_identifier`
option, to be used by the upper layer generating the manifest.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add TemporaryScylla component type to make atomic updates of SSTable Scylla metadata using temporary files
and atomic rename operations possible. This will be needed in further commit to rewrite metadata together with
the statistics component.
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream.
These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculating and verifying the diges.
New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches.
Backport is not required, since it is a new feature
Fixes#21776Closesscylladb/scylladb#26702
* github.com:scylladb/scylladb:
file_stream_test: add sstable file streaming integrity verification test cases
streaming: prioritize sender-side errors in tablet_stream_files
sstables: enable integrity check for data file streaming
sstables: Add compressed raw streaming support
sstables: Allow to read digest and checksum from user provided file instance
sstables: add overload of data_stream() to accept custom file_input_stream_options
This patch enables integrity check in 'create_stream_sources()' by introducing a new
'sstable_data_stream_source_impl' class for handling the Data component of
SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead
of the raw input_stream.
These additional checks require reading the digest and CRC components from
disk, which may introduce some I/O overhead. For uncompressed SSTables,
this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the
cost comes from reading, calculation and verifying the digest.
Implement compressed_raw_file_data_source that streams compressed chunks
without decompression while verifying checksums and calculating digests.
Extends raw_stream enum to support compressed_chunks mode.
This data_source implementation will be used in the next commits
for file based streaming.
Add overloaded methods to read digest and checksum from user-provided file
handles:
- 'read_digest(file f)'
- 'read_checksum(file f)
This will be useful for tablet file-based streaming to enable integrity verification, as the streaming code uses SSTable snapshots with open files to prevent missing components when SSTables are unlinked.
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime
Closesscylladb/scylladb#26617
Currently sstables_manager keeps a reference on global db::config to configure itself. Most of other services use their own specific configs with much less data on-board for the same purposes (e.g. #24841, #19051 and #23705 did same for other services) This PR applies this approach to sstables_manager as well.
Mostly it moves various values from db::config onto newly introduced struct sstables_manager::config, but it also adds specific tracking of sstable_file_io_extensions and patches tools/scylla-sstable not to use sstables_manager as "proxy" object to get db::config from along its calls.
Shuffling components dependencies, no need to backport
Closesscylladb/scylladb#27021
* github.com:scylladb/scylladb:
sstables_manager: Drop db::config from sstables_manager
tools/sstable: Make shard_of_with_tablets use db::config argument
tools/sstable: Add db::config& to all operations
tools/sstable: Get endpoints from storage manager
sstables_manager: Hold sstable IO extensions on it
sstables: Manager helper to grab file io extensions
sstables_manager: Move default format on config
sstables_manager: Move enable_sstable_data_integrity_check on config
sstables_manager: Move data_file_directories on config
sstables_manager: Move components_memory_reclaim_threshold on config
sstables_manager: Move column_index_auto_scale_threshold on config
sstables_manager: Move column_index_size on config
sstables_manager: Move sstable_summary_ratio on config
sstables_manager: Move enable_sstable_key_validation on config
sstables_manager: Move available_memory on config
code: Introduce sstables_manager::config
sstables: Patch get_local_directories() to work on vector of paths
code: Rename sstables_manager::config() into db_config()
This patch adds a metric for pre-compression size of sstable files.
This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.
As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.
New functionality, no backport needed.
Closesscylladb/scylladb#26996
* github.com:scylladb/scylladb:
replica/table: add a metric for hypothetical total file size without compression
replica/table: keep track of total pre-compression file size
Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.
This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.
Fix that, and add some test that catches this.
Fixesscylladb/scylladb#27082Closesscylladb/scylladb#27083
Following 9b6ce030d0 ("sstables: remove quadratic (and possibly
exponential) compile time in parse()"), where we removed recursion
in reading, we do the same here for variadic write. This results
in a small reduction in compile time.
Note the problem isn't very bad here. This is tail-recursion, so likely
removed by the compiler during optimization, and we don't have additional
amplification due to future::then() double-compiling the ready-future
and unready-future paths. Still, better to avoid quadratic compile
times.
Closesscylladb/scylladb#27050
The tool may open sstables on S3. For that it gets configured endpoints
with the help of db::config obtained from sstables_manager.db_config().
However, storage endpoints are maintained by sstables storage manager,
and since tool has this instance, it's better to use storage manager to
get list of endpoints.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently manager holds a reference on db::config and when sstables IO
extensions are needed it grabs them from this config. Since db::config
is going to be removed from sstables manager, it should either keep
track of all config extensions, or only those that it needs. This patch
makes the latter choice and keeps reference to sstable_file_io_ext. on
manager. The reference is passed as constructor argument, not via
manager config, but it's a random choice, no specific reason why not
putting it on config itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently all the code that needs to iterate over sstables extensions
get config from manager, extensions from it and then iterate. Add a
helper that returns extensions directly. No real changes, just a helper.
Next patch will change the way the helper works.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's explicitly `me` type by default, but places that can write sstables
override it with db::config value: replica::database, tests and scylla
sstable tool.
Live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only the
replica::database and tests may want to re-configure it.
This one is live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.
This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.
To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.
Backport is not required, it is an improvement
Refs #21776Closesscylladb/scylladb#26444
* github.com:scylladb/scylladb:
boost/repair_test: add repair reader integrity verification test cases
test/lib: allow to disable compression in random_mutation_generator
sstables: Skip checksum and digest reads for unlinked SSTables
table: enable integrity checks for streaming reader
table: Add integrity option to table::make_sstable_reader()
sstables: Add integrity option to create_single_key_sstable_reader
This patch fixes 2 issues at one go:
First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.
Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.
Fixes#26982
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#26991
Set its default value to the one from db/config.cc. Only the
replica::database may want to re-configure it.
This one is live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it OFF by default and update only those callers, that may have it
ON -- the replica::database, tests and scylla-sstable tool.
Also not live-updateable, so plain bool.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, this parameter is passed to sstables_manager as explicit
constructor argument.
Also, it's not live-updateable, so a plain size_t type for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is specific configuration for sstables_manager. All places that
construct sstables manager are updated to provide config to it. For now
the config is empty and exists alongside with db::config. Further
patches will populate the former config with data and the latter config
will be eventually removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it uses db::config. Next patches will eliminate db::config from this
code and the helper in question will need to get datadir names
explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The config() method name is going to return sstables_manager config, so
first need to set this name free.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
During sstable summary parsing, the entire header was read into a single
buffer upfront and then parsed to obtain the positions. If the header
was too large, it could trigger oversized allocation warnings.
This commit updates the parse method to read one position at a time from
the input stream instead of reading the entire header at once. Since
`random_access_reader` already maintains an internal buffer of 128 KB,
there is no need to pre read the entire header upfront.
Fixes#24428
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26846
Add an _unlinked flag to track SSTable unlink state and check it in
read_digest() and read_checksum() methods to skip file reads for
unlinked SSTables, preventing potential file not found errors.
Every table and sstable set keeps track of the total file size
of contained sstables.
Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.
We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.
This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
_last_key is a multi-fragment buffer.
Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.
The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.
But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).
Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.
Fix that.
We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.
Fixesscylladb/scylladb#26819Closesscylladb/scylladb#26839
parse() taking a list of elements is quadratic (during compile time) in
that it generates recursive calls to itself, each time with one fewer
parameter. The total size of the parameter lists in all these generated
functions is quadratic in the initial parameter list size.
It's also exponential if we ignore inlining limits, since each .then()
call expands to two branches - a ready future branch and a non-ready
future branch. If the compiler did not give up, we'd have 2^list_len
branches. For sure the compiler does not do so indefinitely, but the effort
getting there is wasted.
Simplify by using a fold expression over the comma operator. Instead
of passing the remaining parameter list in each step, we pass only
the parameter we are processing now, making processing linear, and not
generating unnecessary functions.
It would be better expressed using pack expansion statements, but these
are part of C++26.
The largest offender is probably stats_metadata, with 21 elements.
dev-mode sstables.o:
text data bss dec hex filename
1760059 1312 7673 1769044 1afe54 sstables.o.before
1745533 1312 7673 1754518 1ac596 sstables.o.after
We save about 15k of text with presumably a corresponding (small)
decrease in compile time.
Closesscylladb/scylladb#26735