scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 12:47:02 +00:00

Author	SHA1	Message	Date
Botond Dénes	c7c5817808	Merge 'Improve timestamp heuristics for tombstone garbage collection' from Benny Halevy When purging regular tombstone consult the min_live_timestamp, if available. This is safe since we don't need to protect dead data from resurrection, as it is already dead. For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp, if available, otherwise fallback to the min_live_timestamp. If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data. In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable. If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp. Fixes scylladb/scylladb#20423 Fixes scylladb/scylladb#20424 > [!NOTE] > no backport needed at this time > We may consider backport later on after given some soak time in master/enterprise > since we do see tombstone accumulation in the field under some materialized views workloads Closes scylladb/scylladb#20446 * github.com:scylladb/scylladb: cql-pytest: add test_compaction_tombstone_gc sstable_compaction_test: add mv_tombstone_purge_test sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection sstable_compaction_test: tombstone_purge_test: add testlog debugging sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp sstable, compaction: add debug logging for extended min timestamp stats compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats compaction: define max_purgeable_fn tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh sstables: scylla_metadata: add ext_timestamp_stats compaction_group, storage_group, table_state: add extended timestamp stats getters sstables, memtable: track live timestamps memtable_encoding_stats_collector: update row_marker: do nothing if missing	2024-09-13 08:56:51 +03:00
Kefu Chai	3e84d43f93	treewide: use seastar::format() or fmt::format() explicitly before this change, we rely on `using namespace seastar` to use `seastar::format()` without qualifying the `format()` with its namespace. this works fine until we changed the parameter type of format string `seastar::format()` from `const char*` to `fmt::format_string<...>`. this change practically invited `seastar::format()` to the club of `std::format()` and `fmt::format()`, where all members accept a templated parameter as its `fmt` parameter. and `seastar::format()` is not the best candidate anymore. despite that argument-dependent lookup (ADT for short) favors the function which is in the same namespace as its parameter, but `using namespace` makes `seastar::format()` more competitive, so both `std::format()` and `seastar::format()` are considered as the condidates. that is what is happening scylladb in quite a few caller sites of `format()`, hence ADT is not able to tell which function the winner in the name lookup: ``` /__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous 265 \| return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id()); \| ^~~~~~ /usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 4290 \| format(format_string<_Args...> __fmt, _Args&&... __args) \| ^ /__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 143 \| format(fmt::format_string<A...> fmt, A&&... a) { \| ^ ``` in this change, we change all `format()` to either `fmt::format()` or `seastar::format()` with following rules: - if the caller expects an `sstring` or `std::string_view`, change to `seastar::format()` - if the caller expects an `std::string`, change to `fmt::format()`. because, `sstring::operator std::basic_string` would incur a deep copy. we will need another change to enable scylladb to compile with the latest seastar. namely, to pass the format string as a templated parameter down to helper functions which format their parameters. to miminize the scope of this change, let's include that change when bumping up the seastar submodule. as that change will depend on the seastar change. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-11 23:21:40 +03:00
Benny Halevy	4de4af954f	sstables: scylla_metadata: add ext_timestamp_stats Store and retrieve the optional extended timestamp statistics (min_live_timestamp and min_live_row_marker_timestamp) in the scylla_metadata component. Note that there is no need for a cluster feature to store those attributes since the scylla_metadata on-disk format is extensible so that old sstables can be read by new versions, seeing the extra stats is missing, and new sstables can be read by old versions that ignore unknown scylla metadata section types. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-09-10 19:05:57 +03:00
Benny Halevy	14d86a3a12	sstables, memtable: track live timestamps When garbage collecting tombstones, we care only about shadowing of live data. However, currently we track min/max timestamp of both live and dead data, but there is no problem with purging tombstones that shadow dead data (expired or shdowed by other tombstones in the sstable/memtable). Also, for shadowable tombstones, we track live row marker timestamps separately since, if the live row marker timestamp is greater than a shadowable tombstone timestamp, then the row marker would shadow the shadowable tombstone thus exposing the cells in that row, even if their timestasmp may be smaller than the shadow tombstone's. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-09-10 19:05:49 +03:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Lakshmi Narayanan Sreethar	7b58fa2534	sstables: use _origin in write path Now that the origin is available inside the sstable object, no need to pass it to the methods called in the write path. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-07-16 20:44:28 +05:30
Lakshmi Narayanan Sreethar	b762a09dcd	sstable::open_sstable: pass and store origin Pass origin when opening the sstable from the writer and store it in the sstable object. This will make the origin available for the entire write path. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-07-16 20:43:30 +05:30
Michał Chojnowski	1a8ee69a43	sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's There are two schema's associated with a sstable writer: the sstable's schema (i.e. the schema of the table at the time when the sstable object was created), and the writer's schema (equal to the schema of the reader which is feeding into the writer). It's easy to mix up the two and break something as a result. The writer's schema is needed to correctly interpret and serialize the data passing through the writer, and to populate the on-disk metadata about the on-disk schema. The sstables's schema is used to configure some parameters for newly created sstable, such as bloom filter false positive ratio, or compression. The problem fixed by this patch is that the writer was wrongly creating the compressor objects based on its own schema, but using them based based on the sstable's schema the sstable's schema. This patch forces the writer to use the sstable's schema for both.	2024-07-11 12:53:54 +02:00
Michał Chojnowski	d10b38ba5b	sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's There are two schema's associated with a sstable writer: the sstable's schema (i.e. the schema of the table at the time when the sstable object was created), and the writer's schema (equal to the schema of the reader which is feeding into the writer). It's easy to mix up the two and break something as a result. The writer's schema is needed to correctly interpret and serialize the data passing through the writer, and to populate the on-disk metadata about the on-disk schema. The sstables's schema is used to configure some parameters for newly created sstable, such as bloom filter false positive ratio, or compression. The problem fixed by this patch is that the writer was wrongly creating the filter based on its own schema, while the layer outside the writer was interpreting it as if it was created with the sstable's schema. This patch forces the writer to pick the filter's parameters based on the sstable's schema instead.	2024-07-11 12:53:54 +02:00
Lakshmi Narayanan Sreethar	c80df8504c	sstables::maybe_rebuild_filter_from_index: log sstable origin Log the sstable origin when its bloom filter is being rebuilt. The origin has to be passed to the method by the caller as it is not available in the sstable object when the filter is rebuilt. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#19601	2024-07-04 10:01:23 +03:00
Lakshmi Narayanan Sreethar	21e463b108	sstables/mx/writer: rebuild bloom filters with bad partition estimates The bloom filters are built with partition estimates, as the actual partition count might not be available in all the cases. If the estimate was bad, the bloom filters might end up too large or too small than their optimal sizes. Rebuild such bloom filters with actual partition count before the filter is written to disk and the sstable is sealed. Fixes #19049 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-06-24 12:06:02 +05:30
Lakshmi Narayanan Sreethar	afc90657d6	sstables/mx/writer: add variable to track number of partitions consumed Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-06-24 12:06:02 +05:30
Ferenc Szili	b06af5b2b9	sstable: write dead_rows count to system.large_partitions	2024-05-02 11:49:10 +02:00
Ferenc Szili	63e724c974	sstable: added counter for dead rows	2024-05-02 11:49:10 +02:00
Ferenc Szili	98bec4e02a	sstable: large data handler needs to count range tombstones as rows When issuing warnings about partitions with the number of rows above a configured threshold, the large partitions handler does not take into consideration the number of range tombstone markers in the total rows count. This fix adds the number of range tombstone markers to the total number of rows and saves this total in system.large_partitions.rows (if it is above the threshold). It also adds a new column range_tombstones to the system.large_partitions table which only contains the number of range tombstone markers for the given partition. This PR fixes the first part of issue #13968 It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.	2024-04-22 15:24:18 +02:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Avi Kivity	8ee75ae8f4	sstables: writer: don't require effective_replication_map for sharding metadata Currently, we pass an effective_replication_map_ptr to sstable_writer, so that we can get a stable dht::sharder for writing the sharding metadata. This is needed because with tablets, the sharder can change dynamically. However, this is both bad and unnecessary: - bad: holding on to an effective_replication_map_ptr is a barrier for topology operations, preventing tablet migrations (etc) while an sstable is being written - unnecessary: tablets don't require sharding metadata at all, since two tablets cannot overlap (unlike two sstables from different shards in the same node). So the first/last key is sufficient to determine the shard/tablet ownership. Given that, just pass the sharder for vnode sstables, and don't generate sharding metadata for tablet sstables.	2024-01-23 22:23:08 +02:00
Yaniv Kaul	c658bdb150	Typos: fix typos in comments Fixes some typos as found by codespell run on the code. In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc. Follow-up commits will take care of them. Refs: https://github.com/scylladb/scylladb/issues/16255 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2023-12-02 22:37:22 +02:00
Kefu Chai	9c24be05c3	sstable/writer: log sstable name and pk when capping ldt when the local_deletion_time is too large and beyond the epoch time of INT32_MAX, we cap it to INT32_MAX - 1. this is a signal of bad configuration or a bug in scylla. so let's add more information in the logging message to help track back to the source of the problem. Fixes #15015 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-21 19:25:32 +08:00
Tomasz Grabiec	17d6163548	sstables: Generate sharding metadata using sharder from erm when writing We need to keep sharding metadata consistent with tablet mapping to shards in order for node restart to detect that those sstables belong to a single shard and that resharding is not necessary. Resharding of sstables based on tablet metadata is not implemented yet and will abort after this series. Keeping sharding metadata accurate for tablets is only necessary until compaction group integration is finished. After that, we can use the sstable token range to determine the owning tablet and thus the owning shard. Before that, we can't, because a single sstable may contain keys from different tablets, and the whole key range may overlap with keys which belong to other shards.	2023-06-21 00:58:24 +02:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Pavel Emelyanov	ac1e56c9d9	sstable, storage: Virtualize data sink making for Data and Index Add the make_data_or_index_sink() virtual method and its implementation for filesystem_storage. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-10 16:43:01 +03:00
Pavel Emelyanov	1d4fcce5dd	sstable/writer: Shuffle writer::init_file_writers() The method needs to create two data sinks -- for Data and for Index files -- and then wrap it with more stuff (compression, checksums, streams, etc.). With S3 backend using file-output-stream won't work, becase S3 storage cannot provide writable file API (it has data_sink instead). This patch extracts file_data_sink creation so that it could be virtualized with storage API later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-10 16:43:01 +03:00
Kefu Chai	c37f4e5252	treewide: use fmt::join() when appropriate now that fmtlib provides fmt::join(). see https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view there is not need to revent the wheel. so in this change, the homebrew join() is replaced with fmt::join(). as fmt::join() returns an join_view(), this could improve the performance under certain circumstances where the fully materialized string is not needed. please note, the goal of this change is to use fmt::join(), and this change does not intend to improve the performance of existing implementation based on "operator<<" unless the new implementation is much more complicated. we will address the unnecessarily materialized strings in a follow-up commit. some noteworthy things related to this change: * unlike the existing `join()`, `fmt::join()` returns a view. so we have to materialize the view if what we expect is a `sstring` * `fmt::format()` does not accept a view, so we cannot pass the return value of `fmt::join()` to `fmt::format()` * fmtlib does not format a typed pointer, i.e., it does not format, for instance, a `const std::string`. but operator<<() always print a typed pointer. so if we want to format a typed pointer, we either need to cast the pointer to `void` or use `fmt::ptr()`. * fmtlib is not able to pick up the overload of `operator<<(std::ostream& os, const column_definition* cd)`, so we have to use a wrapper class of `maybe_column_definition` for printing a pointer to `column_definition`. since the overload is only used by the two overloads of `statement_restrictions::add_single_column_parition_key_restriction()`, the operator<< for `const column_definition*` is dropped. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-16 20:34:18 +08:00
Pavel Emelyanov	0959739216	sstables: Remove always-false sstable_writer_config::leave_unsealed It was used in sstables streaming code up until `e5be3352` (database, streaming, messaging: drop streaming memtables) or nearby, then the whole feature was reworked. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12967	2023-02-23 12:50:06 +01:00
Kefu Chai	45f0449ccf	sstables: mx/writer: remove defaulted move ctor because its base class of `writer_impl` has a member variable `_validator`, which has its copy ctor deleted. let's just drop the defaulted move ctor, as compiler is not able to generate one for us. ``` /home/kefu/dev/scylladb/sstables/mx/writer.cc:805:5: error: explicitly defaulted move constructor is implicitly deleted [-Werror,-Wdefaulted-function-deleted] writer(writer&& o) = default; ^ /home/kefu/dev/scylladb/sstables/mx/writer.cc:528:16: note: move constructor of 'writer' is implicitly deleted because base class 'sstable_writer::writer_impl' has a deleted move constructor class writer : public sstable_writer::writer_impl { ^ /home/kefu/dev/scylladb/sstables/writer_impl.hh:29:48: note: copy constructor of 'writer_impl' is implicitly deleted because field '_validator' has a deleted copy constructor mutation_fragment_stream_validating_filter _validator; ^ /home/kefu/dev/scylladb/mutation/mutation_fragment_stream_validator.hh:188:5: note: 'mutation_fragment_stream_validating_filter' has been explicitly marked deleted here mutation_fragment_stream_validating_filter(const mutation_fragment_stream_validating_filter&) = delete; ^ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12877	2023-02-15 23:06:10 +02:00
Kefu Chai	0cb842797a	treewide: do not define/capture unused variables these warnings are found by Clang-17 after removing `-Wno-unused-lambda-capture` and '-Wno-unused-variable' from the list of disabled warnings in `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-15 22:57:18 +02:00
Avi Kivity	69a385fd9d	Introduce schema/ module Schema related files are moved there. This excludes schema files that also interact with mutations, because the mutation module depends on the schema. Those files will have to go into a separate module. Closes #12858	2023-02-15 11:01:50 +02:00
Avi Kivity	c5e4bf51bd	Introduce mutation/ module Move mutation-related files to a new mutation/ directory. The names are kept in the global namespace to reduce churn; the names are unambiguous in any case. mutation_reader remains in the readers/ module. mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this patch. This is a step forward towards librarization or modularization of the source base. Closes #12788	2023-02-14 11:19:03 +02:00
Avi Kivity	7c7eb81a66	Merge 'Encapsulate filesystem access by sstable into filesystem_storage subsclass' from Pavel Emelyanov This is to define the API sstable needs from underlying storage. When implementing object-storage backend it will need to implement those. The API looks like future<> snapshot(const sstable& sst, sstring dir, absolute_path abs) const; future<> quarantine(const sstable& sst, delayed_commit_changes* delay); future<> move(const sstable& sst, sstring new_dir, generation_type generation, delayed_commit_changes* delay); void open(sstable& sst, const io_priority_class& pc); // runs in async context future<> wipe(const sstable& sst) noexcept; future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity); It doesn't have "list" or alike, because it's not a method of an individual sstable, but rather the one from sstables_manager. It will come as separate PR. Closes #12217 * github.com:scylladb/scylladb: sstable, storage: Mark dir/temp_dir private sstable: Remove get_dir() (well, almost) sstable: Add quarantine() method to storage sstable: Use absolute/relative path marking for snapshot() sstable: Remove temp_... stuff from sstable sstable: Move open_component() on storage sstable: Mark rename_new_sstable_component_file() const sstable: Print filename(type) on open-component error sstable: Reorganize new_sstable_component_file() sstable: Mark filename() private sstable: Introduce index_filename() tests: Disclosure private filename() calls sstable: Move wipe_storage() on storage sstable: Remove temp dir in wipe_storage() sstable: Move unlink parts into wipe_storage sstable: Remove get_temp_dir() sstable: Move write_toc() to storage sstable: Shuffle open_sstable() sstable: Move touch_temp_dir() to storage sstable: Move move() to storage sstable: Move create_links() to storage sstable: Move seal_sstable() to storage sstable: Tossing internals of seal_sstable() sstable: Move remove_temp_dir() to storage sstable: Move create_links_common() to storage sstable: Move check_create_links_replay() to storage sstable: Remove one of create_links() overloads sstable: Remove create_links_and_mark_for_removal() sstable: Indentation fix after prevuous patch sstable: Coroutinize create_links_common() sstable: Rename create_links_common()'s "dir" argument sstable: Make mark_for_removal bool_class sstable, table: Add sstable::snapshot() and use in table::take_snapshot sstable: Move _dir and _temp_dir on filesystem_storage sstable: Use sync_directory() method test, sstable: Use component_basename in test sstables: Move read_{digest\|checksum} on sstable	2022-12-18 17:29:35 +02:00
Pavel Emelyanov	636d49f1c1	sstable: Shuffle open_sstable() When an sstable is prepared to be written on disk the .write_toc() is called on it which created temporary toc file. Prior to this, the writer code calls generate_toc() to collect components on the sstable. This patch adds the .open_sstable() API call that does both. This prepares the write_toc() part to be moved to storage, because it's not just "write data into TOC file", it's the first step in transaction implemeted on top of rename()s. The test need care -- there's rewrite_toc_without_scylla_component() thing in utils that doesn't want the generate_toc() part to be called. It's not patched here and continues calling .write_toc(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00
Tomasz Grabiec	23e4c83155	position_in_partition: Make after_key() work with non-full keys This fixes a long standing bug related to handling of non-full clustering keys, issue #1446. after_key() was creating a position which is after all keys prefixed by a non-full key, rather than a position which is right after that key. This will issue will be caught by cql_query_test::test_compact_storage in debug mode when mutation_partition_v2 merging starts inserting sentinels at position after_key() on preemption. It probably already causes problems for such keys.	2022-12-14 14:47:33 +01:00
Benny Halevy	7286f5d314	sstables: mx/writer: optimize large data stats members order Since `_partition_size_entry` and `_rows_in_partition_entry` are accessed at the same time when updated, and similarly `_cell_size_entry` and `_elements_in_collection_entry`, place the member pairs closely together to improve data cache locality. Follow the same order when preparing the `scylla_metadata::large_data_stats` map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-05 10:54:04 +03:00
Benny Halevy	8c8a0adb40	sstables: mx/writer: keep large data stats entry as members To save the map lookup on the hot write path, keep each large data stats entry as a member in the writer object and build a map for storing the disk_hash in the scylla metadata only when finalizing it in consume_end_of_stream. Fixes #11686 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-05 10:54:04 +03:00
Benny Halevy	6dadca2648	db/large_data_handler: maybe_record_large_cells: consider collection_elements Detect large_collections when the number of collection_elements is above the configured threshold. Next step would be to record the number of collection_elements in the system.large_cells table, when the respective cluster feature is enabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:05 +03:00
Benny Halevy	7dead10742	sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells And update the sstable elements_in_collection stats entry. Next step would be to forward it to large_data_handler().maybe_record_large_cells(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:41:58 +03:00
Benny Halevy	54ab038825	sstables: mx/writer: add large_data_type::elements_in_collection Add a new large_data_stats type and entry for keeping the collection_elements_count_threshold and the maximum value of collection_elements. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:41:56 +03:00
Benny Halevy	ae7fd1c7b2	sstables: do not include db/large_data_handler.hh in sstables.hh Reduce dependencies by only forward-declaring class db::large_data_handler in sstables.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-29 12:42:58 +03:00
Raphael S. Carvalho	e2ccafbe38	compaction: Add support to split large partitions Adds support for splitting large partitions during compaction. Large partitions introduce many problems, like memory overhead and breaks incremental compaction promise. We want to split large partitions across fixed-size fragments. We'll allow a partition to exceed size limit by 10%, as we don't want to unnecessarily split partitions that just crossed the limit boundary. To avoid having to open a minimal of 2 fragments in a read, partition tombstone will be replicated to every fragment storing the partition. The splitting isn't enabled by default, and can be used by strategies that are run aware like ICS. LCS still cannot support it as it's still using physical level metadata, not run id. An incremental reader for sstable runs will follow soon. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:23:16 -03:00
Benny Halevy	7747b8fa33	sstables: define run_identifier as a strong tagged_uuid type Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11321	2022-08-18 19:03:10 +03:00
Avi Kivity	f5062f4b5a	Merge 'Use generation_type for SSTable ancestors' from Raphael "Raph" Carvalho To avoid a discrepancy about underlying generation type once something other than integer is allowed for the sstable generation. Also simplifies one generic writer interface for sealing sstable statistics. Closes #10703 * github.com:scylladb/scylla: sstables: Use generation_type for compaction ancestors sstables: Make compaction ancestors optional when sealing statistics	2022-06-01 19:55:08 +03:00
Raphael S. Carvalho	d36604703f	sstables: Make compaction ancestors optional when sealing statistics Compaction ancestors is only available in versions older than mx, therefore we can make it optional in seal_statistics(). The motivation is that mx writer will no longer call sstable::compaction_ancestors() which return type will be soon changed to type generation_type, so the returned value can be something other than an integer, e.g. uuid. We could kill compaction_ancestors in seal_statistics interface, but given that most generic write functions still work for older versions, if there were still a writer for them, I decided to not do it now. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-05-31 15:26:03 -03:00
Mikołaj Sielużycki	bc18e97473	sstable_writer: Fix mutation order violation The change - adds a test which exposes a problem of a peculiar setup of tombstones that trigger a mutation fragment stream validation exception - fixes the problem Applying tombstones in the order: range_tombstone_change pos(ck1), after_all_prefixed, tombstone_timestamp=1 range_tombstone_change pos(ck2), before_all_prefixed, tombstone=NONE range_tombstone_change pos(NONE), after_all_prefixed, tombstone=NONE Leads to swapping the order of mutations when written and read from disk via sstable writer. This is caused by conversion of range_tombstone_change (in memory representation) to range tombstone marker (on disk representation) and back. When this mutation stream is written to disk, the range tombstone markers type is calculated based on the relationship between range_tombstone_changes. The RTC series as above produces markers (start, end, start). When the last marker is loaded from disk, it's kind gets incorrectly loaded as before_all_prefixed instead of after_all_prefixed. This leads to incorrect order of mutations. The solution is to skip writing a new range_tombstone_change with empty tombstone if the last range_tombstone_change already has empty tombstone. This is redundant information and can be safely removed, while the logic of encoding RTCs as markers doesn't handle such redundancy well. Closes #10643	2022-05-31 13:39:48 +03:00
Benny Halevy	33bad72fd2	sstables: mx: add pi_auto_scale_events metric Counts the number of promoted index auto-scale events. A large number of those, relative to `partition_writes`, indicates that `column_index_size_in_kb` should be increased. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-24 13:32:39 +03:00
Benny Halevy	6677028212	sstables: mx/writer: auto-scale promoted index Add column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB). When the promoted index (serialized) size gets to this threshold, it's halved by merging each two adjacent blocks into one and doubling the desired_block_size. Fixes #4217 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-24 13:32:35 +03:00
Botond Dénes	105bf8888a	sstables: convert mx writer to v2 The sstables::sstable class has two methods for writing sstables: 1) sstable_writer get_writer(...); 2) future<> write_components(flat_mutation_reader, ...); (1) directly exposes the writer type, so we have to update all users of it (there is not that many) in this same patch. We defer updating users of (2) to a follow-up commits.	2022-03-10 07:03:49 +02:00
Botond Dénes	11adb404c6	sstables/metadata_collector: use position_in_partition for min/max keys Instead of naked clustering keys. Working with the latter is dangerous because it cannot accurately represent the entire clustering domain: it cannot represent positions between (before/after) keys. For this reason the metadata collector had a separate update_min_max_components() overload for range tombstones because the positions of these cannot be represented by clustering keys alone. Moving to position_in_partition solves this problem and it is now enough to have a single overload with position_in_partition_view. This is also more future proof as it will work with range tombstone changes without any additional changes.	2022-03-10 07:03:49 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Avi Kivity	4d7a013e94	sstables: mx: writer: make large partition stats accounting branch-free It is bad form to introduce branches just for statistics, since branches can be expensive (even when perfectly predictable, they consume branch history resources). Switch to simple addition instead; this should be not cause any cache misses since we already touch other statistics earlier. The inputs are already boolean, but cast them to boolean just so it is clear we're adding 0/1, not a count. Closes #9626	2021-11-15 11:28:48 +02:00
Michael Livshin	a7511cf600	system keyspace: record partitions with too many rows Add "rows" field to system.large_partitions. Add partitions to the table when they are too large or have too many rows. Fixes #9506 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #9577	2021-11-14 14:25:18 +02:00

1 2

80 Commits