Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:
lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;
Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.
This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.
Fixes#25799Closesscylladb/scylladb#27470
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in the sstable structure and later persisted to disk
as part of the Scylla metadata component during writer::consume_end_of_stream.
Add TemporaryScylla component type to make atomic updates of SSTable Scylla metadata using temporary files
and atomic rename operations possible. This will be needed in further commit to rewrite metadata together with
the statistics component.
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream.
These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculating and verifying the diges.
New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches.
Backport is not required, since it is a new feature
Fixes#21776Closesscylladb/scylladb#26702
* github.com:scylladb/scylladb:
file_stream_test: add sstable file streaming integrity verification test cases
streaming: prioritize sender-side errors in tablet_stream_files
sstables: enable integrity check for data file streaming
sstables: Add compressed raw streaming support
sstables: Allow to read digest and checksum from user provided file instance
sstables: add overload of data_stream() to accept custom file_input_stream_options
This patch enables integrity check in 'create_stream_sources()' by introducing a new
'sstable_data_stream_source_impl' class for handling the Data component of
SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead
of the raw input_stream.
These additional checks require reading the digest and CRC components from
disk, which may introduce some I/O overhead. For uncompressed SSTables,
this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the
cost comes from reading, calculation and verifying the digest.
Implement compressed_raw_file_data_source that streams compressed chunks
without decompression while verifying checksums and calculating digests.
Extends raw_stream enum to support compressed_chunks mode.
This data_source implementation will be used in the next commits
for file based streaming.
Add overloaded methods to read digest and checksum from user-provided file
handles:
- 'read_digest(file f)'
- 'read_checksum(file f)
This will be useful for tablet file-based streaming to enable integrity verification, as the streaming code uses SSTable snapshots with open files to prevent missing components when SSTables are unlinked.
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime
Closesscylladb/scylladb#26617
Currently sstables_manager keeps a reference on global db::config to configure itself. Most of other services use their own specific configs with much less data on-board for the same purposes (e.g. #24841, #19051 and #23705 did same for other services) This PR applies this approach to sstables_manager as well.
Mostly it moves various values from db::config onto newly introduced struct sstables_manager::config, but it also adds specific tracking of sstable_file_io_extensions and patches tools/scylla-sstable not to use sstables_manager as "proxy" object to get db::config from along its calls.
Shuffling components dependencies, no need to backport
Closesscylladb/scylladb#27021
* github.com:scylladb/scylladb:
sstables_manager: Drop db::config from sstables_manager
tools/sstable: Make shard_of_with_tablets use db::config argument
tools/sstable: Add db::config& to all operations
tools/sstable: Get endpoints from storage manager
sstables_manager: Hold sstable IO extensions on it
sstables: Manager helper to grab file io extensions
sstables_manager: Move default format on config
sstables_manager: Move enable_sstable_data_integrity_check on config
sstables_manager: Move data_file_directories on config
sstables_manager: Move components_memory_reclaim_threshold on config
sstables_manager: Move column_index_auto_scale_threshold on config
sstables_manager: Move column_index_size on config
sstables_manager: Move sstable_summary_ratio on config
sstables_manager: Move enable_sstable_key_validation on config
sstables_manager: Move available_memory on config
code: Introduce sstables_manager::config
sstables: Patch get_local_directories() to work on vector of paths
code: Rename sstables_manager::config() into db_config()
This patch adds a metric for pre-compression size of sstable files.
This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.
As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.
New functionality, no backport needed.
Closesscylladb/scylladb#26996
* github.com:scylladb/scylladb:
replica/table: add a metric for hypothetical total file size without compression
replica/table: keep track of total pre-compression file size
Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.
This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.
Fix that, and add some test that catches this.
Fixesscylladb/scylladb#27082Closesscylladb/scylladb#27083
Following 9b6ce030d0 ("sstables: remove quadratic (and possibly
exponential) compile time in parse()"), where we removed recursion
in reading, we do the same here for variadic write. This results
in a small reduction in compile time.
Note the problem isn't very bad here. This is tail-recursion, so likely
removed by the compiler during optimization, and we don't have additional
amplification due to future::then() double-compiling the ready-future
and unready-future paths. Still, better to avoid quadratic compile
times.
Closesscylladb/scylladb#27050
The tool may open sstables on S3. For that it gets configured endpoints
with the help of db::config obtained from sstables_manager.db_config().
However, storage endpoints are maintained by sstables storage manager,
and since tool has this instance, it's better to use storage manager to
get list of endpoints.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently manager holds a reference on db::config and when sstables IO
extensions are needed it grabs them from this config. Since db::config
is going to be removed from sstables manager, it should either keep
track of all config extensions, or only those that it needs. This patch
makes the latter choice and keeps reference to sstable_file_io_ext. on
manager. The reference is passed as constructor argument, not via
manager config, but it's a random choice, no specific reason why not
putting it on config itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently all the code that needs to iterate over sstables extensions
get config from manager, extensions from it and then iterate. Add a
helper that returns extensions directly. No real changes, just a helper.
Next patch will change the way the helper works.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's explicitly `me` type by default, but places that can write sstables
override it with db::config value: replica::database, tests and scylla
sstable tool.
Live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only the
replica::database and tests may want to re-configure it.
This one is live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.
This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.
To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.
Backport is not required, it is an improvement
Refs #21776Closesscylladb/scylladb#26444
* github.com:scylladb/scylladb:
boost/repair_test: add repair reader integrity verification test cases
test/lib: allow to disable compression in random_mutation_generator
sstables: Skip checksum and digest reads for unlinked SSTables
table: enable integrity checks for streaming reader
table: Add integrity option to table::make_sstable_reader()
sstables: Add integrity option to create_single_key_sstable_reader
This patch fixes 2 issues at one go:
First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.
Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.
Fixes#26982
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#26991
Set its default value to the one from db/config.cc. Only the
replica::database may want to re-configure it.
This one is live-updateable, so use updateable_value<> type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it OFF by default and update only those callers, that may have it
ON -- the replica::database, tests and scylla-sstable tool.
Also not live-updateable, so plain bool.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, this parameter is passed to sstables_manager as explicit
constructor argument.
Also, it's not live-updateable, so a plain size_t type for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is specific configuration for sstables_manager. All places that
construct sstables manager are updated to provide config to it. For now
the config is empty and exists alongside with db::config. Further
patches will populate the former config with data and the latter config
will be eventually removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it uses db::config. Next patches will eliminate db::config from this
code and the helper in question will need to get datadir names
explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The config() method name is going to return sstables_manager config, so
first need to set this name free.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
During sstable summary parsing, the entire header was read into a single
buffer upfront and then parsed to obtain the positions. If the header
was too large, it could trigger oversized allocation warnings.
This commit updates the parse method to read one position at a time from
the input stream instead of reading the entire header at once. Since
`random_access_reader` already maintains an internal buffer of 128 KB,
there is no need to pre read the entire header upfront.
Fixes#24428
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26846
Add an _unlinked flag to track SSTable unlink state and check it in
read_digest() and read_checksum() methods to skip file reads for
unlinked SSTables, preventing potential file not found errors.
Every table and sstable set keeps track of the total file size
of contained sstables.
Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.
We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.
This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
_last_key is a multi-fragment buffer.
Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.
The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.
But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).
Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.
Fix that.
We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.
Fixesscylladb/scylladb#26819Closesscylladb/scylladb#26839
parse() taking a list of elements is quadratic (during compile time) in
that it generates recursive calls to itself, each time with one fewer
parameter. The total size of the parameter lists in all these generated
functions is quadratic in the initial parameter list size.
It's also exponential if we ignore inlining limits, since each .then()
call expands to two branches - a ready future branch and a non-ready
future branch. If the compiler did not give up, we'd have 2^list_len
branches. For sure the compiler does not do so indefinitely, but the effort
getting there is wasted.
Simplify by using a fold expression over the comma operator. Instead
of passing the remaining parameter list in each step, we pass only
the parameter we are processing now, making processing linear, and not
generating unnecessary functions.
It would be better expressed using pack expansion statements, but these
are part of C++26.
The largest offender is probably stats_metadata, with 21 elements.
dev-mode sstables.o:
text data bss dec hex filename
1760059 1312 7673 1769044 1afe54 sstables.o.before
1745533 1312 7673 1754518 1ac596 sstables.o.after
We save about 15k of text with presumably a corresponding (small)
decrease in compile time.
Closesscylladb/scylladb#26735
Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one.
Adopting newer seastar API, no need to backport
Closesscylladb/scylladb#26747
* github.com:scylladb/scylladb:
commitlog: Remove unused work::r stream variable
ec2_snitch: Fix indentation after previous patch
ec2_snitch: Coroutinize the aws_api_call_once()
sstable: Construct output_stream for data instantly
test: Don't reuse on-stack input stream
This patch introduces a new overload of 'sstable::data_stream()' that allows
callers to provide their own 'file_input_stream_options'.
This change will be useful in the next commit to enable integrity checking
for file streaming.
The option is a knob that allows to reject dictionary-aware compressors
in the validation stage of CREATE/ALTER statements, and in the
validation of `sstable_compression_user_table_options`. It was
introduced in 7d26d3c7cb to allow the admins of Scylla Cloud to
selectively enable it in certain clusters. For more details, check:
https://github.com/scylladb/scylla-enterprise/issues/5435
As of this series, we want to start offering dictionary compression as
the default option in all clusters, i.e., treat it as a generally
available feature. This makes the knob redundant.
Additionally, making dictionary compression the default choice in
`sstable_compression_user_table_options` creates an awkward dependency
with the knob (disabling the knob should cause
`sstable_compression_user_table_options` to fall back to a non-dict
compressor as default). That may not be very clear to the end user.
For these reasons, mark the option as "Deprecated", remove all relevant
tests, and adjust the business logic as if dictionary compression is
always available.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations.
This allows callers to enable SSTable integrity checks during single-key reads.
This changes makes local output_stream variable be constructed in the
declaration statement with the help of ternary operator thus avoiding
both -- default-initialization and move-assignment depending on the
standalone condition checking.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Other than patching Scylla sinks to implement new data_sink_impl::put(std::span<temporary_buffer>) overload, the PR changes transport write_response() method to stop using output_stream::write(scattered_message) because it's also gone.
Using newer seastar API, no need to backport
Closesscylladb/scylladb#26592
* github.com:scylladb/scylladb:
code: Fix indentation after previous patch
code: Switch to seastar API level 9
transport: Open-code invoke_with_counting into counting_data_sink::put
transport: Don't use scattered_message
utils: Implement memory_data_sink::put(net::packet)
Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage.
Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers.
This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend.
Similarly with storage_options.
Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc).
Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends.
Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake.
Fixes#25359Fixes#26453Closesscylladb/scylladb#26186
* github.com:scylladb/scylladb:
docs::dev::object_storage: Add some initial info on GS storage
docs/dev: Add mention of (nested) docker usage in testing.md
sstables::object_storage_client: Forward memory limit semaphore to GS instance
utils::gcp::object_storage: Add optional memory limits to up/download
sstables::object_storage_client: Add multi-upload support for GS
utils::gcp::storage: Add merge objects operation
test_backup/test_basic: Make tests multiplex both s3 and gs backends
test::cluster::conftest: Add support for multiple object storage backends
boost::gcs_storage_test: reindent
boost::gcs_storage_test: Convert to use fixture
tests::boost: Add GS object storage cases to mirror S3 ones
tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS
tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env
sstables::object_storage_client: Add google storage implementation
test_services: Allow testing with GS object storage parameters
utils::gcp::gcp_credentials: Add option to create uninitialized credentials
utils::gcp::object_storage: Make create_download_source return seekable_data_source
utils::gcp::object_storage: Add defensive copies of string_view params
utils::gcp::object_storage: Add missing retry backoff increate
utils::gcp::object_storage: Add timestamp to object listing
utils::gcp::object_storage: Add paging support to list_objects
object_storage_client: Add object_name wrapper type
utils::gcp::object_storage: Add optional abort_source
utils::rest::client: Add abort_source support
sstables: Use object_storage_client for remote storage
sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial)
s3::upload_progress: Promote to general util type
storage_options: Abstract s3 to "object_storage" and add gs as option
sstables::file_io_extension: Change "creator" callback to just data_source
utils::io-wrappers: Add ranged data_source
utils::io-wrappers: Add file wrapper type for seekable_source
utils::seekable_source: Add a seekable IO source type
object_storage_endpoint_param: Add gs storage as option
config: break out object_storage_endpoint_param preparing for multi storage
In the new API the biggest change is to implement the only
data_sink_impl::put(span<temporary_buffer>) overload.
Encrypted file impl and sstables compress sink use fallback_put() helper
that generates a chain of continuations each holding a buffer.
The counting_data_sink in transport had mostly been patched to correct
implementation by the previous patch, the change here is to replace
vector argument with span one.
Most other sinks just re-implement their put(vector<temporary_buffer>)
overload by iterating over span and non-preemptively grabbing buffers
from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>