scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 03:45:11 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	b5ede873f2	sstable_directory: Get components lister from manager For now this is almost a no-op because manager just calls sstables_directory code back to create the lister. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	3f9b8c855d	sstable_directory: Extract directory lister Currently the utils/lister.cc code is in use to list regular files in a directory. This patch wraps the lister into more abstract components lister class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	abd3602b10	sstable_directory: Remove sstable creation callback It's no longer used. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	3d559391df	sstable_directory: Call manager to make sstables Now the directory code has everyhting it needs to create sstable object and can stop using the external lambda. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	db657a8d1c	sstable_directory: Keep error handler generator Yet another continuation to previous patch -- IO error handlers generator is also needed to create sstables. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	4281f4af42	sstable_directory: Keep schema_ptr Continuation of one-before-previous patch. In order to create sstable without external lambda the directory code needs schema. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	8df1bcb907	sstable_directory: Use directory semaphore from manager After previous patch sstables_directory code may no longer require for semaphore argument, because it can get one from manager. This makes the directory API shorter and simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	4da941e159	sstable_directory: Keep reference on manager The sstables_directly accesses /var/lib/scylla/data in two ways -- lists files in it and opens sstables. The latter is abdtracted with the help of lambdas passed around, but the former (listing) is done by using directory liters from utils. Listing sstables components with directlry lister won't work for object storage, the directory code will need to call some abstraction layer instead. Opening sstables with the help of a lambda is a bit of overkill, having sstables manager at hand could make it much simpler. Said that, this patch makes sstables_directly reference sstables_manager on start. This change will also simplify directory semaphore usage (next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:19 +03:00
Pavel Emelyanov	5e13ce2619	sstables_manager: Keep directory semaphore reference Preparational patch. The semaphore will be used by sstables_directory in next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 12:03:18 +03:00
Pavel Emelyanov	be8512d7cc	sstables, code: Wrap directory semaphore with concurrency Currently this is a sharded<semaphore> started/stopped in main and referenced by database in order to be fed into sstables code. This semaphore always comes with the "concurrency" parameter that limits the parallel_for_each parallelizm. This patch wraps both together into directory_semaphore class. This makes its usage simpler and will allow extending it in the future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-05 11:59:30 +03:00
Pavel Emelyanov	084522d9eb	sstable: Mark some methods private There are several class sstable methods that reveal internal directory path to caller. It's not object-storage-friendly. Fortunately, all the callers of those methods had been patched not to work with full paths, so these can be marked private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:15:02 +03:00
Pavel Emelyanov	a702affd4d	sstables: Reimplement batch directory sync after move There's a table::move_sstables_from_staging() method that gets a bunch of sstables and moves them from staging subdit into table's root datadir. Not to flush the root dir for every sstable move, it asks the sstable::move_to_new_dir() not to flush, but collects staging dir names and flushes them and the root dir at the end altothether. In order to make it more friendly to object-storage and to remove one more caller of sstable::get_dir() the delayed_commit_changes struct is introduced. It collects _all_ the affected dir names in unordered_set, then allows flushing them. By default the move_to_new_dir() doesn't receive this object and flushes the directories instantly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:08:47 +03:00
Pavel Emelyanov	339feb4205	sstables: Remove fsync_directory() helper The one effectively wraps existing seastar sync_directory() helper into two io_check-s. It's simpler just to call the latter directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-02 21:05:43 +03:00
Pavel Emelyanov	1d91914166	sstables: Drop set_generation() method The method became unused since `70e5252a` (table: no longer accept online loading of SSTable files in the main directory) and the whole concept of reshuffling sstables was dropped later by `7351db7c` (Reshape upload files and reshard+reshape at boot). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12165	2022-12-01 22:17:10 +02:00
Avi Kivity	5ae98ab3de	sstables: generation_type: forgo constexpr on hash of generation_type std::hash isn't constexpr, so gcc refuses to make hash of generation_type constexpr. It's pointless anyway since we never have a compile-time sstable generation.	2022-11-28 21:58:30 +02:00
Avi Kivity	7c66fdcad1	Merge 'Simplify sstable_directory configuration' from Pavel Emelyanov When started the sstable_directory is constructed with a bunch of booleans that control the way its process_sstable_dir method works. It's shorter and simpler to pass these booleans into method directly, all the more so there's another flag that's already passed like this. Closes #12005 * github.com:scylladb/scylladb: sstable_directory: Move all RAII booleans onto flags sstable_directory: Convert sort-sstables argument to flags struct sstable_directory: Drop default filter	2022-11-23 16:16:04 +02:00
Pavel Emelyanov	22133a3949	sstable_directory: Move all RAII booleans onto flags There's a bunch of booleans that control the behavior of sstable directory scanning. Currently they are described as verbose bool_class<>-es and are put into sstable_directory construction time. However, these are not used outside of .process_sstable_dir() method and moving them onto recently added flags struct makes the code much shorter (29 insertions(+), 121 deletions(-)) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-22 18:30:00 +03:00
Pavel Emelyanov	7ca5e143d7	sstable_directory: Convert sort-sstables argument to flags struct The sstable_directory::process_sstable_dir() accepts a boolean to control its behavior when collecting sstables. Turn this boolean into a structure of flags. The intention is to extend this flags set in the future (next patch). This boolean is true all the time, but one place sets it to true in a "verbose" manner, like this: bool sort_sstables_according_to_owner = false; process_sstable_dir(directory, sort_sstables_according_to_owner).get(); the local variable is not used anymore. Using designated initializers solves the verbosity in a nicer manner. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-22 18:19:23 +03:00
Pavel Emelyanov	7c7017d726	sstable_directory: Drop default filter It's used as default argument for .reshape() method, but callers specify it explicitly. At the same time the filter is simple enough and is only used in one place so that the caller can just use explicit lambda. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-22 18:19:23 +03:00
Pavel Emelyanov	2f9b7931af	sstables: Delete log file in replay_pending_delete_log() It's natural that the replayer cleans up after itself Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:16:22 +03:00
Pavel Emelyanov	bdc47b7717	sstables: Move deletion log manipulations to sstable_directory.cc The deletion log concept uses the fact that files are on a POSIX filesystem. Support for another storage type will have to reimplement this place, so keep the FS-specific code in _directory.cc file. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:16:21 +03:00
Pavel Emelyanov	865c51c6cf	sstables: Open-code delete_sstables() call It's no used by any other code, but to be used it requires the caller to tranform TOC file names by prepending sstable directory to them. Things get shorter and simpler if merging the helper code into the caller. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	a61c96a627	sstables: Use fs::path in replay_pending_delete_log() It's called by a code that has fs::path at hand and internally uses helpers that need fs::path too, so no need to convert it back and forth. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	f5684bcaf0	sstables: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	85a73ca9c6	sstables: Coroutinize replay_pending_delete_log Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	6f3fd94162	sstables: Read pending delete log with one line helper There's one in seastar since recently Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:25 +03:00
Pavel Emelyanov	2dedf4d03a	sstables: Dont write pending log with file_writer It's a wrapper over output_stream with offset tracking and the tracking is not needed to generate a log file. As a bonus of switching back we get a stream.write(sstring) sugar. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-21 13:15:24 +03:00
Avi Kivity	994603171b	Merge 'Add validator to the mutation compactor' from Botond Dénes Fragment reordering and fragment dropping bugs have been plaguing us since forever. To fight them we added a validator to the sstable write path to prevent really messed up sstables from being written. This series adds validation to the mutation compactor. This will cover reads and compaction among others, hopefully ridding us of such bugs on the read path too. This series fixes some benign looking issues found by unit tests after the validator was added -- although how benign a producer emitting two partition-ends depends entirely on how the consumer reacts to it, so no such bug is actually benign. Fixes: https://github.com/scylladb/scylladb/issues/11174 Closes #11532 * github.com:scylladb/scylladb: mutation_compactor: add validator mutation_fragment_stream_validator: add a 'none' validation level test/boost/mutation_query_test: test_partition_limit: sort input data querier: consume_page(): use partition_start as the sentinel value treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{} treewide: use ::for_partition_start() instead of ::partition_start_tag_t{} position_in_partition: add for_partition_{start,end}()	2022-11-20 20:33:26 +02:00
Botond Dénes	437fcdeeda	Merge 'Make use of enum_set in directory lister' from Pavel Emelyanov The lister accepts sort of a filter -- what kind of entries to list, regular, directories or both. It currently uses unordered_set, but enum_set is shorter and better describes the intent. Closes #12017 * github.com:scylladb/scylladb: lister: Make lister::dir_entry_types an enum_set database: Avoid useless local variable	2022-11-18 12:15:26 +02:00
Pavel Emelyanov	bc62ca46d4	lister: Make lister::dir_entry_types an enum_set This type is currently an unordered_set, but only consists of at most two elements. Making it an enum_set renders it into a size_t variable and better describes the intention. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-11-17 19:01:45 +03:00
Avi Kivity	b8b78959fb	build: switch to packaged libdeflate rather than a submodule Now that our toolchain is based on Fedora 37, we can rely on its libdeflate rather than have to carry our own in a submodule. Frozen toolchain is regenerated. As a side effect clang is updated from 15.0.0 to 15.0.4. Closes #12000	2022-11-17 08:01:00 +02:00
Botond Dénes	0bcfc9d522	treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{} We just added a convenience static factory method for partition end, change the present users of the clunky constructor+tag to use it instead.	2022-11-11 09:58:18 +02:00
Botond Dénes	f1a039fc2b	treewide: use ::for_partition_start() instead of ::partition_start_tag_t{} We just added a convenience static factory method for partition start, change the present users of the clunky constructor+tag to use it instead.	2022-11-11 09:58:18 +02:00
Tomasz Grabiec	4ff204c028	Merge 'cache: make all removals of cache items explicit' from Michał Chojnowski This series is a step towards non-LRU cache algorithms. Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor. However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`. This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction. Closes #11716 * github.com:scylladb/scylladb: utils: lru: assert that evictables are unlinked before destruction utils: lru: remove unlink_from_lru() cache: make all cache unlinks explicit	2022-10-17 12:47:02 +02:00
Michał Chojnowski	f340c9cca5	utils: lru: remove unlink_from_lru() unlink_from_lru() allows for unlinking elements from cache without notifying the cache. This messes up any potential cache bookkeeping. Improved that by replacing all uses of unlink_from_lru() with calls to lru::remove(), which does have access to cache's metadata.	2022-10-17 12:07:27 +02:00
Michał Chojnowski	d785364375	cache: make all cache unlinks explicit Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning that elements of the list unlink themselves from the list automatically on destruction. Some parts of the code rely on that, and don't unlink them manually. However, this precludes accurate bookkeeping about the cache. Elements only have access to themselves and their neighbours, not to any bookkeeping context. Therefore, a destructor cannot update the relevant metadata. In this patch, we fix this by adding explicit unlink calls to places where it would be done by a destructor. In a following patch, we will add an assert to the destructor to check that every element is unlinked before destruction.	2022-10-17 12:07:27 +02:00
Avi Kivity	20bad62562	Merge 'Detect and record large collections' from Benny Halevy This series adds support for detecting collections that have too many items and recording them in `system.large_cells`. A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000. Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively. A new column was added to system.large_cells: `collection_items`. Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold. Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred. Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it. The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled. In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items. Closes #11449 Closes #11674 * github.com:scylladb/scylladb: sstables: mx/writer: optimize large data stats members order sstables: mx/writer: keep large data stats entry as members db: large_data_handler: dynamically update config thresholds utils/updateable_value: add transforming_value_updater db/large_data_handler: cql_table_large_data_handler: record large_collections db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler db/large_data_handler: cql_table_large_data_handler: move ctor out of line docs: large-rows-large-cells-tables: fix typos db/system_keyspace: add collection_elements column to system.large_cells gms/feature_service: add large_collection_detection cluster feature test: sstable_3_x_test: add test_sstable_too_many_collection_elements test: lib: simple_schema: add support for optional collection column test: lib: simple_schema: build schema in ctor body test: lib: simple_schema: cql: define s1 as static only if built this way db/large_data_handler: maybe_record_large_cells: consider collection_elements db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells sstables: mx/writer: add large_data_type::elements_in_collection db/large_data_handler: get the collection_elements_count_threshold db/config: add compaction_collection_elements_count_warning_threshold test: sstable_3_x_test: add test_sstable_write_large_cell test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random	2022-10-06 18:28:21 +03:00
Benny Halevy	7286f5d314	sstables: mx/writer: optimize large data stats members order Since `_partition_size_entry` and `_rows_in_partition_entry` are accessed at the same time when updated, and similarly `_cell_size_entry` and `_elements_in_collection_entry`, place the member pairs closely together to improve data cache locality. Follow the same order when preparing the `scylla_metadata::large_data_stats` map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-05 10:54:04 +03:00
Benny Halevy	8c8a0adb40	sstables: mx/writer: keep large data stats entry as members To save the map lookup on the hot write path, keep each large data stats entry as a member in the writer object and build a map for storing the disk_hash in the scylla metadata only when finalizing it in consume_end_of_stream. Fixes #11686 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-05 10:54:04 +03:00
Botond Dénes	4c13328788	Merge 'Return all sstables in table::get_sstable_set()' from Raphael "Raph" Carvalho This fixes a regression introduced by `1e7a444`, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set. Fixes https://github.com/scylladb/scylladb/issues/11681. Closes #11682 * github.com:scylladb/scylladb: replica: Return all sstables in table::get_sstable_set() sstables: Fix cloning of compound_sstable_set	2022-10-05 06:55:50 +03:00
Pavel Emelyanov	2c1ef0d2b7	sstables.hh: Remove unused headers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11709	2022-10-04 23:37:07 +02:00
Raphael S. Carvalho	eddf32b94c	sstables: Fix cloning of compound_sstable_set The intention was that its clone() would actually clone the content of an existing set into a new one, but the current impl is actually moving the sets instead of copying them. So the original set becomes invalid. Luckily, this problem isn't triggered as we're not exposing the compound set in the table's interface, so the compound_sstable_set::clone() method isn't being called. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-04 10:43:25 -03:00
Benny Halevy	6dadca2648	db/large_data_handler: maybe_record_large_cells: consider collection_elements Detect large_collections when the number of collection_elements is above the configured threshold. Next step would be to record the number of collection_elements in the system.large_cells table, when the respective cluster feature is enabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:05 +03:00
Benny Halevy	7dead10742	sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells And update the sstable elements_in_collection stats entry. Next step would be to forward it to large_data_handler().maybe_record_large_cells(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:41:58 +03:00
Benny Halevy	54ab038825	sstables: mx/writer: add large_data_type::elements_in_collection Add a new large_data_stats type and entry for keeping the collection_elements_count_threshold and the maximum value of collection_elements. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:41:56 +03:00
Tomasz Grabiec	9dae2b9c02	Merge 'mutation_fragment_stream_validator: various API improvements' from Botond Dénes The low-level `mutation_fragment_stream_validator` gets `reset()` methods that until now only the high-level `mutation_fragment_stream_validating_filter` had. Active tombstone validation is pushed down to the low level validator. The low level validator, which was a pain to use until now due to being very fussy on which subset of its API one used, is made much more robust, not requiring the user to stick to a subset of its API anymore. Closes #11614 * github.com:scylladb/scylladb: mutation_fragment_stream_validator: make interface more robust mutation_fragment_stream_validator: add reset() to validating filter mutation_fragment_stream_validator: move active tomsbtone validation into low level validator	2022-10-03 16:23:46 +02:00
Benny Halevy	ae7fd1c7b2	sstables: do not include db/large_data_handler.hh in sstables.hh Reduce dependencies by only forward-declaring class db::large_data_handler in sstables.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-29 12:42:58 +03:00
Botond Dénes	a8cbf66573	mutation_fragment_stream_validator: move active tomsbtone validation into low level validator Currently the active range tombstone change is validated in the high level `mutation_fragment_stream_validating_stream`, meaning that users of the low-level `mutation_fragment_stream_validator` don't benefit from checking that tombstones are properly closed. This patch moves the validation down to the low-level validator (which is what the high-level one uses under the hood too), and requires all users to pass information about changes to the active tombstone for each fragment.	2022-09-26 10:17:27 +03:00
Michał Chojnowski	cdb3e71045	sstables: add a flag for disabling long-term index caching Long-term index caching in the global cache, as introduced in 4.6, is a major pessimization for workloads where accesses to the index are (spacially) sparse. We want to have a way to disable it for the affected workloads. There is already infrastructure in place for disabling it for BYPASS CACHE queries. One way of solving the issue is hijacking that infrastructure. This patch adds a global flag (and a corresponding CLI option) which controls index caching. Setting the flag to `false` causes all index reads to behave like they would in BYPASS CACHE queries. Consequences of this choice: - The per-SSTable partition_index_cache is unused. Every index_reader has its own, and they die together. Independent reads can no longer reuse the work of other reads which hit the same index pages. This is not crucial, since partition accesses have no (natural) spatial locality. Note that the original reason for partition_index_cache -- the ability to share reads for the lower and upper bound of the query -- is unaffected. - The per-SSTable cached_file is unused. Every index_reader has its own (uncached) input stream from the index file, and every bsearch_clustered_cursor has its own cached_file, which dies together with the cursor. Note that the cursor still can perform its binary search with caching. However, it won't be able to reuse the file pages read by index_reader. In particular, if the promoted index is small, and fits inside the same file page as its index_entry, that page will be re-read. It can also happen that index_reader will read the same index file page multiple times. When the summary is so dense that multiple index pages fit in one index file page, advancing the upper bound, which reads the next index page, will read the same index file page. Since summary:disk ratio is 1:2000, this is expected to happen for partitions with size greater than 2000 partition keys. Fixes #11202	2022-09-15 17:16:26 +03:00
Raphael S. Carvalho	e2ccafbe38	compaction: Add support to split large partitions Adds support for splitting large partitions during compaction. Large partitions introduce many problems, like memory overhead and breaks incremental compaction promise. We want to split large partitions across fixed-size fragments. We'll allow a partition to exceed size limit by 10%, as we don't want to unnecessarily split partitions that just crossed the limit boundary. To avoid having to open a minimal of 2 fragments in a read, partition tombstone will be replicated to every fragment storing the partition. The splitting isn't enabled by default, and can be used by strategies that are run aware like ICS. LCS still cannot support it as it's still using physical level metadata, not run id. An incremental reader for sstable runs will follow soon. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:23:16 -03:00

1 2 3 4 5 ...

2874 Commits