scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 18:10:39 +00:00

Author	SHA1	Message	Date
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Benny Halevy	3f44dba014	sstables: make_entry_descriptor: make regex non-greedy With greedy matching, an sstable path in a snapshot directory with a tag that resembles a name-<uuid> would match the dir regular expression as the longest match, while a non-greedy regular expression would correctly match the real keyspace and table as the shortest match. Also, add a regression unit test reproducing the issue and validating the fix. Fixes #25242 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25323	2025-08-07 15:35:11 +03:00
Michał Chojnowski	85964094f6	sstables/trie: implement BTI node traversal This commit implements routines for traversal of BTI nodes in their on-disk format. The `node_reader` concept is currently unused (i.e. not asserted by any template). It will only be used in the next PR, which will implement trie cursor routines parametrized `node_reader`. But I'm including it in this PR to make it clear which functions will be needed by the higher layer.	2025-08-05 00:56:50 +02:00
Botond Dénes	20693edb27	Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to). In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface. Later, we will add BTI indexes which will also implement this interface. This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`. Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes. No backports needed, this is a preparation for new functionality. Closes scylladb/scylladb#25000 * github.com:scylladb/scylladb: sstables: add sstable::make_index_reader() and use where appropriate sstables/mx: in readers, use abstract_index_reader instead of index_reader sstables: in validate(), use abstract_index_reader instead of index_reader where possible test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader sstables/index_reader: introduce abstract_index_reader sstables/index_reader: extract a prefetch_lower_bound() method	2025-07-17 14:32:08 +03:00
Michał Chojnowski	4e4a4b6622	sstables: add sstable::make_index_reader() and use where appropriate If we add multiple index implementations, users of index readers won't easily know which concrete index reader type is the right one to construct. We also don't want pieces of code to depend on functionality specific to certain concrete types, if that's not necessary. So instead of constructing the readers by themselves, they can use a helper function, which will return an abstract (virtual) index reader. This patch adds such a function, as a method of `sstable`.	2025-07-17 10:32:57 +02:00
Ernest Zaslavsky	8d49bb8af2	sstables: Start using `make_data_or_index_source` in `sstable` Convert all necessary methods to be awaitable. Start using `make_data_or_index_source` when creating data_source for data and index components. For proper working of compressed/checksummed input streams, start passing stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.	2025-07-15 10:10:23 +03:00
Ernest Zaslavsky	dff9a229a7	sstables: refactor readers and sources to use coroutines Refactor readers and sources to support coroutine usage in preparation for integration with `make_data_or_index_source`. Move coroutine-based member initialization out of constructors where applicable, and defer initialization until first use.	2025-07-15 10:10:23 +03:00
Ernest Zaslavsky	8ac2978239	sstables: coroutinize futurized readers Coroutinize futurized readers and sources to get ready for using `make_data_or_index_source` in `sstable`	2025-07-06 09:18:39 +03:00
Ernest Zaslavsky	0de61f56a2	sstables: add `make_data_or_index_source` to the `storage` Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`.	2025-07-06 09:18:39 +03:00
Avi Kivity	b33dd2bd7d	Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. Closes scylladb/scylladb#24492 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-06-29 18:18:36 +03:00
Botond Dénes	ebd9420687	sstables: add corrupt_data_handler to sstables::sstables Similar to how large_data_handler is handled, propagate through sstables::sstables_manager and store its owner: replica::database. Tests and tools are also patched. Mostly mechanical changes, updating constructors and patching callers.	2025-06-25 08:41:26 +03:00
Botond Dénes	bce89c0f5e	sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path So parse errors on corrupt SSTables don't result in crashes, instead just aborting the read in process. There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This patch tried to focus on those usages which are in the read path. Some places not only used on the read path may have been converted too, where the usage of said method is not clear.	2025-06-24 09:16:28 +03:00
Botond Dénes	27e26ed93f	sstables/exceptions: introduce parse_assert() To replace SCYLLA_ASSERT on the read/parse path. SSTables can get corrupt for various reasons, some outside of the database's control. A bad SSTable should not bring down the database, the parsing should simply be aborted, with as much information printed as possible for the investigation of the nature of the corruption. The newly introduced parse_assert() uses on_internal_error() under the hood, which prints a backtrace and optionally allows for aborting when on the error, to generate a coredump.	2025-06-24 09:15:29 +03:00
Raphael S. Carvalho	2d716f3ffe	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426	2025-06-08 15:59:15 +03:00
Nikos Dragazis	eaa2ce1bb5	sstables: Fix race when loading checksum component `read_checksum()` loads the checksum component from disk and stores a non-owning reference in the shareable components. To avoid loading the same component twice, the function has an early return statement. However, this does not guarantee atomicity - two fibers or threads may load the component and update the shareable components concurrently. This can lead to use-after-free situations when accessing the component through the shareable components, since the reference stored there is non-owning. This can happen when multiple compaction tasks run on the same SSTable (e.g., regular compaction and scrub-validate). Fix this by not updating the reference in shareable components, if a reference is already in place. Instead, create an owning reference to the existing component for the current fiber. This is less efficient than using a mutex, since the component may be loaded multiple times from disk before noticing the race, but no locks are used for any other SSTable component either. Also, this affects uncompressed SSTables, which are not that common. Fixes #23728. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#23872	2025-05-27 11:26:35 +03:00
Pavel Emelyanov	5bd3df507e	sstables: Lazily access statistics for trace-level logging There's a message in sstable::get_gc_before_for_fully_expire() method that is trace-level and one of its argument finds a value in sstable statisitics. Finding the value is not quite cheap (makes a lookup in std::unordered_map) and for mostly-off trace messages is just a waste of cycles. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23910	2025-05-12 11:22:31 +03:00
Pavel Emelyanov	0a9675de01	sstable: Use fmt::to_string(sstable::filename()) to get component file path The stream sink abort() method wants to remove component file by its path. For that the path is calculated from storage prefix and component basename, but there's a filename() method for it already. SStable filenames shouldn't be considered as on-disk paths (see #23194), but places that want it should be explicit and format the filename to string by hand. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24039	2025-05-07 22:25:58 +03:00
Pavel Emelyanov	c2ecc45db8	sstable: Remove validate argument from sstable::load_metadata() There are only two callers of the method and the one that wants validation (the sstable::load()) can do it on its own. This helps the other caller (schema loader) being simpler and shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24038	2025-05-07 20:57:37 +03:00
Botond Dénes	6172ff501f	readers: mv reversing_v2.hh reversing.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	f1bd2553ed	readers: mv forwardable_v2.hh forwardable.hh Completely mechanical change.	2025-04-16 04:33:50 -04:00
Michał Chojnowski	cee504f66f	sstables/compress: discard hidden compression options after the decompressor is created Dictionary contents are kept in the list of "compression options" in the header of `CompressionInfo.db`, and they are loaded from disk into memory when the `sstable::compression` object is populated. After the decompressor for the SSTable is created based on those dict contents, they are not needed in RAM anymore. And since they take up a sizeable amount of memory, we would like to free them. In this patch, we discard all "hidden compression options" (currently: only the dictionary contents) from the `sstable::compression` object right after the decompressor is created. (Those options are not supposed to be used for anything else anyway).	2025-04-01 00:07:30 +02:00
Michał Chojnowski	b18ddcb92e	sstables: delegate compressor creation to the compressor factory Remove `compressor::create()`. This enforces that compressors are only created through the `sstable_compressor_factory`. Unlike the synchronous `compressor::create()`, the factory will be able to create dict-aware compressors.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	11be7c0704	compress: remove `compression_parameters::get_compressor()` Following up on the previous commits, we avoid constructing compressors where not necessary, by checking things directly on `compression_parameters` instead.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	8e611536b0	sstables/compress: move ownership of `compressor` to `sstable::compression` SSTable readers and writers use `compressor` objects to compress and decompress chunks of SSTable data files. `compressor` objects are read-only, so only one of them is needed for each SSTable. Before this commit, each reader and writer has its own `compressor` object. This isn't necessary, but it's okay. But later in this series it will stop being okay, because the creation of a `compressor` will become an expensive cross-shard operation (because it might require sharing a compression dictionary from another shard). So we have to adjust the code so that there is only once `compressor` per sstable, not one per reader/writer. We stuff the ownership of this compressor into `sstable::compression`. To make the ownership clear, we remove `compression_ptr` shared pointers from readers and writers, and make them access the compressor via the `sstable::compression` instead.	2025-04-01 00:07:27 +02:00
Avi Kivity	73e4a3c581	sstables: store features early in write path sstable features indicate that an sstable has some extension, or that some bug was fixed. They allow us to know if we can rely on certain properties in a read sstables. Currently, sstable features are set early in the read path (when we read the scylla metadata file) and very late in the write path (when we write the scylla metadata file just before sealing the sstable). However, we happen to read features before we set them in the write path - when we resize the bloom filter for a newly written sstable we instantiate an index reader, and that depends on some features. As a result, we read a disengaged optional (for the scylla metadata component) as if it was engaged. This somehow worked so far, but fails with libstdc++ hash table implementation. Fix it by moving storage of the features to the sstable itself, and setting it early in the write path. Fixes #23484 Closes scylladb/scylladb#23485	2025-03-31 09:33:56 +03:00
Calle Wilund	e02be77af7	sstables::storage: Move wrapping sstable components to storage provider Fixes #23225 Fixes #23185 Moved wrapping component files/sinks to storage provider. Also ensures to wrap data_sinks as well as actual files. This ensures that we actually write encryption if active.	2025-03-20 14:54:24 +00:00
Calle Wilund	d46dcbb769	sstables::file_io_extension: Add a "wrap_sink" method. Similar to wrap file, should wrap a data_sink (used for sstable writers), in obvious write-only, simple stream mode. Default impl will detect if we wrap files for this component, and if so, generate a file wrapper for the input sink, wrap this, and the wrap it in a file_data_sink_impl. This is obviously not efficient, so extensions used in actual non-test code should implement the method.	2025-03-20 14:54:22 +00:00
Pavel Emelyanov	f06cc32812	sstables: Make filename() return component_name Similarly to toc_, index_ and data filenames, make the generic component name getter return back not string, but a wrapper object. Most of callers are log messages and exception generations. Other than that there are tests, filesystem storage driver and few more places in generic code who "know" that they work with real files, so make them use explicit fmt::to_string(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	68c41f0459	sstables: Make file_writer keep component_name on board The class in question is a wrapper around output_stream that writes, flushes and closes the stream in async context. For logging it also keeps the component filename on board, and now it's good time to patch it and keep the component_filename instead. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	1ba91e28cb	sstables: Make get_filename() return component_name Similarly to previous patches -- mostly the result is used as log argument. The remaining users include - scylla sstable tool that dumps component names to json output - API endpoint that returns component names to user - tests these are all good to explicitly convert component_names to strings. There are few more places that expect strings instead of component name objects. For now they also use fmt::to_string() explicitly, partially it will be fixed later, mostly -- as future follow-ups. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	dbb9ee15c1	sstables: Introduce struct component_name The structure wraps const reference to sstable and component_name value (it's an enum of several elements). It also has a formatter so that it can be directly printed in logs (main usage) as well as converted to strings (auxiliary and discourage usage). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	aba400f5d9	sstables: Remove unused sstable::component_filenames() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	a8bc81eb3c	sstables: Use fmt::format instead of string concatenation There are some places that concatentate filenames with something else to get different filename (tool does it) or message for exception (read_toc() helper). This patch uses fmt::format() instead to facilitate future patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	dcc9167734	sstables: Rename filename($component) calls to ${component}_filename() There's a generic sstable::filename(component_type) method that returns a file name for the given component. For "popular" components, namely TOC, Data and Index there are dedicated sstable methods to get their names. Fix existing callers of the generic method to use the former. It's shorter, nicer and makes further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	e6898a8854	sstables: Rename local filename variable to component_name This is to be consistent with future changes and not to bloat them with extra renames Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:20 +03:00
Pavel Emelyanov	86b3e9b50b	code: Move checked-file-impl.hh to util/ fixes: #22100 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23123	2025-03-06 10:22:05 +02:00
Amnon Heiman	94ba8af788	sstables.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_sstables_cell_tombstone_writes scylla_sstables_range_tombstone_reads scylla_sstables_range_tombstone_writes scylla_sstables_row_tombstone_reads scylla_sstables_tombstone_writes	2025-03-03 16:58:38 +02:00
Lakshmi Narayanan Sreethar	77107ddaa3	sstables_manager: rename `increment_total_reclaimable_memory_and_maybe_reclaim()` Renamed the aboved mentioned method to `increment_total_reclaimable_memory()` as it doesn't directly reclaim memory anymore. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Avi Kivity	c71f383cf2	Merge 'sstables: use std::variant instead of boost::variant' from Botond Dénes Continue replacing boost types with std one where possible. Improvement, no backport needed Closes scylladb/scylladb#22290 * github.com:scylladb/scylladb: sstables: disk_types: disk_set_of_tagged_union: boost::variant -> std::variant scylla-gdb.py: std_variant: fix get() sstables: disk_types: remove unused disk_tagged_union	2025-01-29 15:29:15 +02:00
Botond Dénes	b70dccb638	sstables: disk_types: disk_set_of_tagged_union: boost::variant -> std::variant In the spirit of using standard-library types, instead of boost ones where possible. Although a disk type, it is serialized/deserialized with custom code, so the change shouldn't cause any changes in the disk representation.	2025-01-27 09:29:26 -05:00
Avi Kivity	a23a3110b5	utils: config_file: forward_declare boost::program_options classes Avoid pulling in boost dependencies when all we need is the class name. Closes scylladb/scylladb#22453	2025-01-27 10:45:43 +03:00
Asias He	4018dc7f0d	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22467	2025-01-26 12:51:59 +02:00
Avi Kivity	59d3a66d18	Revert "Introduce file stream for tablet" This reverts commit `8208688178`. It was contributed from enterprise, but is too different from the original for me to merge back.	2025-01-22 09:42:20 +02:00
Asias He	8208688178	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22034	2025-01-20 16:43:21 +02:00
Benny Halevy	3e22998dc1	sstables: parse(summary): reserve positions vector We know the number of positions in advance so reserve the chunked_vector capacity for that. Note: reservation replaces the existing reset of the positions member. This is safe since we parse the summary only once as sstable::read_summary() returns early if the summary component is already populated. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#21767	2024-12-26 13:33:29 +02:00
Kefu Chai	6acc5294a4	treewide: migrate from boost::copy_range to std::ranges::to now that we are allowed to use C++23. we now have the luxury of using `std::ranges::to`. in this change, we: - replace `boost::copy_range` to `std::ranges::to` - remove unused `#include` of boost headers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21880	2024-12-26 11:46:26 +02:00
Botond Dénes	d4129ddaa6	Merge 'sstables_manager: do not reclaim unlinked sstables' from Lakshmi Narayanan Sreethar When an sstable is unlinked, it remains in the _active list of the sstable manager. Its memory might be reclaimed and later reloaded, causing issues since the sstable is already unlinked. This patch updates the on_unlink method to reclaim memory from the sstable upon unlinking, remove it from memory tracking, and thereby prevent the issues described above. Added a testcase to verify the fix. Fixes #21887 This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions. Closes scylladb/scylladb#21895 * github.com:scylladb/scylladb: sstables_manager: reclaim memory from sstables on unlink sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable() sstables: introduce disable_component_memory_reload() sstables_manager: log sstable name when reclaiming components	2024-12-19 15:18:16 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Lakshmi Narayanan Sreethar	b7b4c5c661	sstables: introduce disable_component_memory_reload() Added a new method to disable reload of previously reclaimed components from the sstable. This will be used to disable reload of bloom filters after an sstable has been unlinked or deactivated. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-12-17 18:14:43 +05:30
Botond Dénes	05246e123d	Merge 'sstables: Avoid computing column_values_fixed_lengths on each read' from Tomasz Grabiec Reads which need sstable index were computing column_values_fixed_lengths each time. This showed up in perf profile for a sstable-read heavy workload, and amounted to about 1-2% of time. Computing it involves type name parsing. Avoid by using cached per-sstable mapping. There is already sstable::_column_translation which can be used for this. It caches the mapping for the least-recently used schema. Since the cursor uses the mapping only for primary key columns, which are stable, any schema will do, so we can use the last _column_translation. We only need to make sure that it's always armed, so sstable loading is augmented with arming with sstable's schema. Also, fixes a potential use-after-free on schema in column_translation. Closes scylladb/scylladb#21347 * github.com:scylladb/scylladb: sstables: Fix potential use-after-free on column_translation::column_info::name sstables: Avoid computing column_values_fixed_lengths on each read	2024-12-12 12:22:32 +02:00

1 2 3 4 5 ...

1331 Commits