scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-23 00:02:37 +00:00

Author	SHA1	Message	Date
Botond Dénes	85f05fbe1b	Revert "Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk" This reverts commit `866c96f536`, reversing changes made to `367633270a`. This change caused all longevities to fail, with a crash in parsing scylla-metadata. The investigation is still ongoing, with no quick fix in sight yet. Fixes: #27496 Closes scylladb/scylladb#27518	2025-12-16 11:34:40 +02:00
Tomasz Grabiec	082342ecad	Attach names to allocating sections for better debuggability Large reserves in allocating_section can cause stalls. We already log reserve increase, but we don't know which table it belongs to: lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments; Allocating sections used for updating row cache on memtable flush are notoriously problematic. Each table has its own row_cache, so its own allocating_section(s). If we attached table name to those sections, we could identify which table is causing problems. In some issues we suspected system.raft, but we can't be sure. This patch allows naming allocating_sections for the purpose of identifying them in such log messages. I use abstract_formatter for this purpose to avoid the cost of formatting strings on the hot path (e.g. index_reader). And also to avoid duplicating strings which are already stored elsewhere. Fixes #25799 Closes scylladb/scylladb#27470	2025-12-07 14:14:25 +02:00
Taras Veretilnyk	bc2e83bc1f	sstables: store digest of all sstable components in scylla metadata This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.	2025-12-04 21:00:09 +01:00
Taras Veretilnyk	a191503ddf	sstables: Extract file writer closing logic into separate methods Refactor the consume_end_of_stream() method by extracting the inline file writer closing logic into dedicated methods: - close_index_writer() - close_partitions_writer() - close_rows_writer()	2025-12-02 13:07:41 +01:00
Michał Chojnowski	3cf51cb9e8	sstables: fix some typos in comments I added those typos recently, and spellcheckers complain. Closes scylladb/scylladb#26376	2025-10-07 13:20:06 +03:00
Michał Chojnowski	4ca215abbc	sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader Partitions.db uses a piece of the murmur hash of the partition key internally. The same hash is used to query the bloom filter. So to avoid computing the hash twice (which involves converting the key into a hashable linearized form) it would make sense to use the same `hashed_key` for both purposes. This is what we do in this patch. We extract the computation of the `hashed_key` from `make_pk_filter` up to its parent `sstable_set_impl::create_single_key_sstable_reader`, and we pass this hash down both to `make_pk_filter` and to the sstable reader. (And we add a pointer to the `hashed_key` as a parameter to all functions along the way, to propagate it). The number of parameters to `mx::make_reader` is getting uncomfortable. Maybe they should be packed into some structs.	2025-09-29 13:01:22 +02:00
Michał Chojnowski	cee4011e7a	sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller Partitions.db internally uses a piece of the partition key murmur hash (the same hash which is used to compute the token and the relevant bits in the bloom filter). Before this patch, the Partitions.db writer computes the hash internally from the `sstables::partition_key`. That's a waste, because this hash is also computed for bloom filter purposes just before that, in the owning sstable writer. So in this patch we let the caller pass that hash here instead.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	f8e3d5e7c2	sstables/mx: make Index and Summary components optional In previous patches we (hopefully) modified all users of Index and Summary components so that they don't longer need those components to exist. (And can use Partitions and Rows components instead).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	b1984d6798	sstables: implement an alternative way to rebuild bloom filters for sstables without Index For efficiency, the cardinality of the bloom filter (i.e. the number of partition keys which will be written into the sstable) has to be known before elements are inserted into the filter. In some cases (e.g. memtables flush) this number is known exactly. But in others (e.g. repair) it can only be estimated, and the estimation might be very wrong, leading to an oversized filter. Because of that, some time ago we added a piece of logic (ran after the sstable is written, but before it's sealed) which looks at the actual number of written partitions, compares it to the initial estimate (on which the size of the bloom filter was based on), and if the difference is unacceptably large, it rewrites the bloom filter from partition keys contained in Index.db. But the idea to rebuild the bloom filters from index files isn't going to work with BTI indexes, because they don't store whole partition keys. If we want sstables which don't have Index.db files, we need some other way to deal with oversized filters. Partition keys can be recovered from Data.db, but that would often be way too expensive. This patch adds another way. We introduce a new component file, TemporaryHashes. This component, if written at all, contains the 16-byte murmur hash for every partition key, in order, and can be used in place of Index to reconstruct the bloom filter. (Our bloom filters are actually built from the set of murmur hashes of partition keys. The first step of inserting a partition key into a filter is hashing the key. Remembering the hashes is sufficient to build the filter later, without looking at partition keys again.) As of this patch, if the Index component is not being written, we don't allocate and populate a bloom filter during the Data.db write. Instead, we write the murmur hashes to TemporaryHashes, and only later, after the Data write finishes, we allocate the optimal-size, bloom filter, we read the hashes back from TemporaryHashes, and we populate the filter with them. That is suboptimal. Writing the hashes to disk (or worse, to S3) and reading them back is more expensive than building the bloom filter during the main Data pass. So ideally it should be avoided in cases where we know in advance that the partition key count estimate is good enough. (Which should be the case in flushes and compactions). But we defer that to a future patch. (Such a change would involve passing some flag to the sstable writer if the cardinality estimate is trustworthy, and not creating TemporaryHashes if the estimate is trustworthy).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	e0fda9ae6f	sstables/mx/writer: populate BTI index files In the previous patch we added code responsible for creating and opening Partitions.db and Rows.db, but we left those files empty. In this patch, we populate the files using `trie::bti_row_index_writer` and `trie::bti_partition_index_writer`. Note: for the row index, we insert the same clustering blocks to both indexes. The logic for choosing the size of the blocks hasn't been changed in any way. Much of this patch has to do with propagating the current range tombstone down to all places which can start a new clustering block. The reason we need that is that, for each clustering block, BIG indexes store the range tombstone succeeding the block (i.e. the range tombstone in between the given block and its successor) BTI indexes store the range tombstone preceding the block, (i.e. the range tombstone in between the given block and its predecessor). So before the patch there's no code which looks at the current tombstone when starting the block, only when ending the block. This patch adds an extra copy for each `decorated_key`. This is mostly unavoidable -- the BTI partition writer just has to remember the key until its successor appears, to find the common prefix. (We could avoid the key copy if the BTI isn't used, though. We don't do that in this patch, we just let the copy happen).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	cdcf34b3a0	sstables: create and open BTI index files, when enabled This patch adds code responsible for creation and opening of BTI index components (Rows.db, Partitions.db) when BTI index writing is enabled. (It is enabled if the cluster feature is enabled and the relevant config entry permits it). The files are empty for now, and are never read. We will populate and use them in following patches.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	e04ee6d5f6	sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time` There's a test (boost/sstable_compaction_test.cc::tombstone_purge_test) which tests the value of `_stats.capped_tombstone_deletion_time`. Before this patch, for "ms" sstables, `to_deletion_time` would have be called twice for each written partition tombstone, which would fail the test. Since `_pi_write_m.partition_tombstone` always ends up being converted from `tombstone` to `sstables::deletion_time` anyway, let's just make it a `sstables::deletion_time` to begin with. This will ensure that `to_deletion_time` will be able to be only called once per partition tombstone.	2025-09-29 13:01:20 +02:00
Avi Kivity	f6b6312cf4	Merge 'sstables/trie: prepare for integrating BTI indexes with sstable readers and writers' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: introducing the new components, Partitions.db and Rows.db This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller. This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details. The changes are for the sake of new and unreleased code. No backporting should be done. Closes scylladb/scylladb#26075 * github.com:scylladb/scylladb: sstables/mx/reader: remove mx::make_reader_with_index_reader test/boost/bti_index_test: fix indentation sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file sstables/trie: support reader_permit and trace_state properly sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header sstables/trie/bti_index_reader: support BYPASS CACHE test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate sstables/trie: change the signature of bti_partition_index_writer::finish sstables/bti_index: improve signatures of special member functions in index writers streaming/stream_transfer_task: coroutinize `estimate_partitions()` types/comparable_bytes: add a missing implementation for date_type_impl sstables: remove an outdated FIXME storage_service: delete `get_splits()` sstables/trie: fix some comment typos in bti_index_reader.cc sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone	2025-09-18 12:10:27 +03:00
Ernest Zaslavsky	54aa552af7	treewide: Move type related files to a `type` directory As requested in #22110 , moved the files and fixed other includes and build system. Moved files: - duration.hh - duration.cc - concrete_types.hh Fixes: #22110 This is a cleanup, no need to backport Closes scylladb/scylladb#25088	2025-09-17 17:32:19 +03:00
Michał Chojnowski	b7afda5030	sstables/mx/reader: remove mx::make_reader_with_index_reader When `mx::make_reader` is used to construct an sstable reader, it constructs its own index reader internally. `mx::make_reader_with_index_reader` was originally added as a variant of `mx::make_reader` which can be used to inject a custom `index_reader` for testing that the mx Data reader tolerates inexact indexes. But now we want the ability to choose between BIG index readers and BTI index readers if both are present. And at this point, it seems to me that it makes sense to just construct the index reader in the caller and pass it via argument to `mx::make_reader` instead of putting the index selection inside it. So that's what we do in this patch. And we remove `mx::make_reader_with_index_reader` because it's no longer different from `mx::make_reader`.	2025-09-17 12:22:41 +02:00
Michał Chojnowski	416f5d64d4	sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone With `tomb` it's not obvious enough what tombstone this field is about.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	94088f0a41	sstables/writer: add an accessor for the current write position in Data.db It will be used by index tests to know the ground truth for where each partition and row are written, so that this can be checked against the index.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	30dad06c9a	sstables/mx: move clustering_info from writer.cc to types.hh We will use this type as the input to the BTI row index writer. Since it will be implemented in other translation units, the definition of the type has to be moved to a header.	2025-08-14 02:06:33 +02:00
Nadav Har'El	f6a3e6fbf0	sstables: don't depend on fmt 11.1 to build A recent commit `a0c29055e5` added some trace printouts which print an std::reference_wrapper<>. Apparently a formatter for this type was only added to fmt in version 11.1.0, and it doesn't exist on earlier versions, such as fmt 11.0.2 on Fedora 41. Let's avoid requiring shiny-new versions of fmt. The workaround is easy: just unwrap the reference_wrapper - print pr.get() instead of just pr, and Scylla returns to building correctly on Fedora 41. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25228	2025-07-29 11:32:06 +02:00
Avi Kivity	1930f3e67f	Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space. However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position. For example, if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first index entry after key `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.) Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges. Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome. Preparation for new functionality, no backporting needed. Closes scylladb/scylladb#25093 * github.com:scylladb/scylladb: sstables/index_reader: weaken some exactness guarantees in abstract_index_reader test/boost: add a test for inexact index lookups sstables/mx/reader: allow passing a custom index reader to the constructor sstables/index_reader: remove advance_to sstables/mx/reader: handle inexact lookups in `advance_context()` sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` sstables/index_reader: make the return value of `get_partition_key` optional sstables/mx/reader: handle "backward jumps" in forward_to sstables/mx/reader: filter out partitions outside the queried range sstables/mx/reader: update _pr after `fast_forward_to`	2025-07-27 19:39:36 +03:00
Michał Chojnowski	810eb93ff0	sstables/mx/reader: allow passing a custom index reader to the constructor For tests. Will be used for testing how the data reader reacts to various combinations of inexact index lookup results.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	03bf6347e2	sstables/mx/reader: handle inexact lookups in `advance_context()` `advance_context()` needs an ability to advance the index to the partition immediately following the reader's current partition. For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)` But BTI (and any index format which stores only the prefixes of keys instead of whole keys) can't implement `advance_to` with its current semantics. The Data position returned by the index for a generic `advance_to` might be off by one partition. E.g. if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first entry after `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. However, BTI can be used exactly if the partition is known to be present in the sstable. (In the above example, if `bb` is known to be present in the sstable, then it must correspond to `b`. So the index can reliably advance to `bb` or the first partition after it). And this is enough for `advance_context()`, because the current partition is known to be present. So we can replace the usage of `advance_to` with an equivalent API call which only works with present keys, but in exchange is implementable by BTI. This makes `advance_to` unused, so we remove it.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	11792850dd	sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` `advance_to_next_partition()` needs an ability to advance the index to the partition immediately following the reader's current partition. For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)` But BTI (and any index format which stores only the prefixes of keys instead of whole keys) can't implement `advance_to` with its current semantics. The Data position returned by the index for a generic `advance_to` might be off by one partition. E.g. if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first entry after `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. However, BTI can be used exactly if the partition is known to be present in the sstable. (In the above example, if `bb` is known to be present in the sstable, then it must correspond to `b`. So the index can reliably advance to `bb` or the first partition after it). And this is enough for `advance_to_next_partition()`, because the current partition is known to be present. So we can replace the usage of `advance_to` with an equivalent API call which only works with present keys, but in exchange is implementable by BTI.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	141895f9eb	sstables/index_reader: make the return value of `get_partition_key` optional BTI indexes only store encoded prefixes of partition keys, not the whole keys. They can't reliably implement `get_partition_key`. The index reader interface must be weakened and callers must be adapted.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	a0c29055e5	sstables/mx/reader: handle "backward jumps" in forward_to A bunch of code assumes that the Data.db stream can only go forward. But with BTI indexes, if we perform an advance_to, the index can point to a position which the data reader has already passed, since the index is inexact. The logic of the data reader ensures that it has stopped within the last partition range, or just immediately after it, after reading the next partition key and noticing that it doesn't belong to the range. But forward_to can only be used with increasing ranges. The start of the next range must be greater or equal to the end of the previous range. This means that the exact start of the next partition range must be no earlier than: 1. Before the partition key just read by the data reader, if the data reader is positioned immediately after a partition key. 2. The start of the first partition after the current data reader position, if the data reader isn't positioned immediately after a partition key. So, if the index returns a position smaller than the current data reader position, then: 1. If the reader is immediately after a partition key, we have to reuse this partition key (since we can't go back in the stream to read it again), and keep reading from the current position. 2. Otherwise we can safely walk the index to the first partition that lies no earlier than the current position.	2025-07-25 10:49:58 +02:00
Michał Chojnowski	218b2dffff	sstables/mx/reader: filter out partitions outside the queried range The current index format is exact: it always returns the position of the first partition in the queried partition range. But we are about the add an index format where that doesn't have to be the case. In BTI indexes, the lookup can be off by one partition sometimes. This patch prepares the reader for that, by skipping the partitions which were read by the data reader but don't belong to the queried range. Note: as of this patch, only the "normal path" is ever used. We add tests exercising these code paths later. Also note that, as of this patch, actually stepping outside the queried range would cause the reader to end up in a state where the underlying parser is positioned right after partition key immediately following the queried range. If the reader was forwarded to that key in this state, it would trip an assert, because the parser can't handle backward jumps. We will add logic to handle this case in the next patch.	2025-07-25 10:49:57 +02:00
Michał Chojnowski	2b81fdf09b	sstables/mx/reader: update _pr after `fast_forward_to` In later patches, we will prepare the reader for inexact index implementations (ones which can return a Data file range that includes some partitions before or after the queried range). For that, we will need to filter out the partitions outside of the range, and for that we need to remember the range. This is the goal of this patch. Note that we are storing a reference to an argument of `fast_forward_to`. This is okay, because the contract of `mutation_reader` specifies that the caller must keep `pr` alive until the next `fast_forward_to` or until the reader is destroyed.	2025-07-25 10:49:57 +02:00
Ernest Zaslavsky	d2c5765a6b	treewide: Move keys related files to a new keys directory As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system. Moved files: - clustering_bounds_comparator.hh - keys.cc - keys.hh - clustering_interval_set.hh - clustering_key_filter.hh - clustering_ranges_walker.hh - compound_compat.hh - compound.hh - full_position.hh Fixes: #22102 Fixes: #22103 Fixes: #22105 Closes scylladb/scylladb#25082	2025-07-25 10:45:32 +03:00
Botond Dénes	20693edb27	Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to). In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface. Later, we will add BTI indexes which will also implement this interface. This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`. Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes. No backports needed, this is a preparation for new functionality. Closes scylladb/scylladb#25000 * github.com:scylladb/scylladb: sstables: add sstable::make_index_reader() and use where appropriate sstables/mx: in readers, use abstract_index_reader instead of index_reader sstables: in validate(), use abstract_index_reader instead of index_reader where possible test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader sstables/index_reader: introduce abstract_index_reader sstables/index_reader: extract a prefetch_lower_bound() method	2025-07-17 14:32:08 +03:00
Michał Chojnowski	4e4a4b6622	sstables: add sstable::make_index_reader() and use where appropriate If we add multiple index implementations, users of index readers won't easily know which concrete index reader type is the right one to construct. We also don't want pieces of code to depend on functionality specific to certain concrete types, if that's not necessary. So instead of constructing the readers by themselves, they can use a helper function, which will return an abstract (virtual) index reader. This patch adds such a function, as a method of `sstable`.	2025-07-17 10:32:57 +02:00
Michał Chojnowski	1c4065e7dd	sstables/mx: in readers, use abstract_index_reader instead of index_reader This makes clear which methods of index_reader are available for use by sstable readers, and which aren't.	2025-07-17 10:32:57 +02:00
Michał Chojnowski	efcf3f5d66	sstables: in validate(), use abstract_index_reader instead of index_reader where possible After we add a second index implementation, we will probably want to adjust validate() to work with either implementation. Some validations will be format-specific, but some will be common. For now, let's use abstract_index_reader for the validations which can be done through that interface, and let's have downcast-specific codepaths for the others. Note: we change a `get_data_file_position()` call to `data_file_positions().start`. The call happens at the beginning of a partition, and at this points these two expressions are supposed to be equivalent.	2025-07-17 10:32:57 +02:00
Michał Chojnowski	1e7a292ef4	sstables/index_reader: extract a prefetch_lower_bound() method The sstable reader reaches directly for a `clustered_index_cursor`. But a BTI index reader won't be able to implement `clustered_index_cursor`, because a BTI index doesn't store full clustering keys, only some trie-encoded prefixes. So we want to weaken the dependency. Instead of reaching for `clustered_index_cursor`, we add a method which expresses our intent, and we let `index_reader` touch the cursor internally.	2025-07-16 00:13:20 +02:00
Ernest Zaslavsky	dff9a229a7	sstables: refactor readers and sources to use coroutines Refactor readers and sources to support coroutine usage in preparation for integration with `make_data_or_index_source`. Move coroutine-based member initialization out of constructors where applicable, and defer initialization until first use.	2025-07-15 10:10:23 +03:00
Avi Kivity	b33dd2bd7d	Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. Closes scylladb/scylladb#24492 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-06-29 18:18:36 +03:00
Botond Dénes	592ca789e2	sstables/mx/writer: handler rows with empty keys Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Use the recently introduced corrupt_data_handler to handle rows with such corrupt keys. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.	2025-06-25 08:41:29 +03:00
Botond Dénes	bce89c0f5e	sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path So parse errors on corrupt SSTables don't result in crashes, instead just aborting the read in process. There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This patch tried to focus on those usages which are in the read path. Some places not only used on the read path may have been converted too, where the usage of said method is not clear.	2025-06-24 09:16:28 +03:00
Michał Chojnowski	b18ddcb92e	sstables: delegate compressor creation to the compressor factory Remove `compressor::create()`. This enforces that compressors are only created through the `sstable_compressor_factory`. Unlike the synchronous `compressor::create()`, the factory will be able to create dict-aware compressors.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	8e611536b0	sstables/compress: move ownership of `compressor` to `sstable::compression` SSTable readers and writers use `compressor` objects to compress and decompress chunks of SSTable data files. `compressor` objects are read-only, so only one of them is needed for each SSTable. Before this commit, each reader and writer has its own `compressor` object. This isn't necessary, but it's okay. But later in this series it will stop being okay, because the creation of a `compressor` will become an expensive cross-shard operation (because it might require sharing a compression dictionary from another shard). So we have to adjust the code so that there is only once `compressor` per sstable, not one per reader/writer. We stuff the ownership of this compressor into `sstable::compression`. To make the ownership clear, we remove `compression_ptr` shared pointers from readers and writers, and make them access the compressor via the `sstable::compression` instead.	2025-04-01 00:07:27 +02:00
Avi Kivity	73e4a3c581	sstables: store features early in write path sstable features indicate that an sstable has some extension, or that some bug was fixed. They allow us to know if we can rely on certain properties in a read sstables. Currently, sstable features are set early in the read path (when we read the scylla metadata file) and very late in the write path (when we write the scylla metadata file just before sealing the sstable). However, we happen to read features before we set them in the write path - when we resize the bloom filter for a newly written sstable we instantiate an index reader, and that depends on some features. As a result, we read a disengaged optional (for the scylla metadata component) as if it was engaged. This somehow worked so far, but fails with libstdc++ hash table implementation. Fix it by moving storage of the features to the sstable itself, and setting it early in the write path. Fixes #23484 Closes scylladb/scylladb#23485	2025-03-31 09:33:56 +03:00
Pavel Emelyanov	68c41f0459	sstables: Make file_writer keep component_name on board The class in question is a wrapper around output_stream that writes, flushes and closes the stream in async context. For logging it also keeps the component filename on board, and now it's good time to patch it and keep the component_filename instead. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	1ba91e28cb	sstables: Make get_filename() return component_name Similarly to previous patches -- mostly the result is used as log argument. The remaining users include - scylla sstable tool that dumps component names to json output - API endpoint that returns component names to user - tests these are all good to explicitly convert component_names to strings. There are few more places that expect strings instead of component name objects. For now they also use fmt::to_string() explicitly, partially it will be fixed later, mostly -- as future follow-ups. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	80e0030613	sstables: Make sstable::index_filename() return component_name Most of the method callers use it as log parameter. There are few more places that push it to malformed_sstable_exception, which immediately converts it to string, so this patch makes the exception be constructed with the component_name either. And there's one more place that passes this string to file_writer constructor. For now, convert it to string explicitly, but next patches will fix that place to use pure component_name too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:01:23 +03:00
Pavel Emelyanov	dcc9167734	sstables: Rename filename($component) calls to ${component}_filename() There's a generic sstable::filename(component_type) method that returns a file name for the given component. For "popular" components, namely TOC, Data and Index there are dedicated sstable methods to get their names. Fix existing callers of the generic method to use the former. It's shorter, nicer and makes further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	eff61b167c	treewide: Reduce db/config.hh header fanout Drop it from files that obviously don't need it. Also kill some forward declarations while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#22979	2025-02-25 15:16:40 +01:00
Kefu Chai	0237913337	sstables: Migrate from boost::adaptors::indexed to std::views::enumerate This change modernizes the codebase by: - Replacing Boost's indexed adaptor with C++20's std::views::enumerate - Removing unnecessary Boost header inclusion With this change, we can: - Reduce external dependencies - Leverage standard library features - Improve long-term code maintainability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22469	2025-01-26 20:51:14 +02:00
Kefu Chai	93be8f3a0c	db,sstables: migate boost::range::stable_partition to std library now that we are allowed to use C++23. we now have the luxury of using `std::ranges::stable_partition`. in this change, we: - replace `boost::range::stable_parition()` to `std::ranges::stable_parition()` - since `std::ranges::stable_parition()` returns a subrange instead of an iterator, change the names of variables which were previously used for holding the return value of `boost::range::stable_partition()` accordingly for better readability. - remove unused `#include` of boost headers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21911	2024-12-19 14:56:07 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Botond Dénes	05246e123d	Merge 'sstables: Avoid computing column_values_fixed_lengths on each read' from Tomasz Grabiec Reads which need sstable index were computing column_values_fixed_lengths each time. This showed up in perf profile for a sstable-read heavy workload, and amounted to about 1-2% of time. Computing it involves type name parsing. Avoid by using cached per-sstable mapping. There is already sstable::_column_translation which can be used for this. It caches the mapping for the least-recently used schema. Since the cursor uses the mapping only for primary key columns, which are stable, any schema will do, so we can use the last _column_translation. We only need to make sure that it's always armed, so sstable loading is augmented with arming with sstable's schema. Also, fixes a potential use-after-free on schema in column_translation. Closes scylladb/scylladb#21347 * github.com:scylladb/scylladb: sstables: Fix potential use-after-free on column_translation::column_info::name sstables: Avoid computing column_values_fixed_lengths on each read	2024-12-12 12:22:32 +02:00
Kefu Chai	714d12014e	sstable/mx: use subrange.advance() when appropriate Replace manual subrange advancement with the more concise and readable `subrange.advance()` method. This change: - Eliminates unnecessary subrange instance creation - Improves code readability - Reduces potential for unnecessary object allocation - Leverages the built-in `advance()` method for cleaner iterator handling The modification simplifies the iteration logic while maintaining the same functional behavior. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21865	2024-12-12 10:04:12 +02:00

1 2 3 4 5 ...

284 Commits