scylladb

Author	SHA1	Message	Date
Raphael S. Carvalho	2c9b13d2d1	compaction: Check for key presence in memtable when calculating max purgeable timestamp It was observed that some use cases might append old data constantly to memtable, blocking GC of expired tombstones. That's because timestamp of memtable is unconditionally used for calculating max purgeable, even when the memtable doesn't contain the key of the tombstone we're trying to GC. The idea is to treat memtable as we treat L0 sstables, i.e. it will only prevent GC if it contains data that is possibly shadowed by the expired tombstone (after checking for key presence and timestamp). Memtable will usually have a small subset of keys in largest tier, so after this change, a large fraction of keys containing expired tombstones can be GCed when memtable contains old data. Fixes #17599. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17835	2024-03-18 13:37:44 +02:00
Kefu Chai	3f0fbdcd86	replica: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16810	2024-01-17 09:27:09 +02:00
Kefu Chai	0092700ad1	memtable: add formatter for replica::{memtable,memtable_entry} before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define a formatter for replica::memtable and replica::memtable_entry, and remove their operator<<(). Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16793	2024-01-16 16:46:52 +02:00
Avi Kivity	02111d6754	memtable: consolidate _read_section, _allocating_section in a struct Those two members are passed from memtable_list to memtable. Since we wish to pass them from table, it becomes awkward to pass them as two separate variables as their contents are specific to memtable internals. Wrap them in a name that indicates their role (being table-wide shared data for memtables) and pass them as a unit.	2023-12-26 21:11:48 +02:00
Avi Kivity	7d5e22b43b	replica: memtable: don't forget memtable memory allocation statistics A memtable object contains two logalloc::allocating_section members that track memory allocation requirements during reads and writes. Because these are local to the memtable, each time we seal a memtable and create a new one, these statistics are forgotten. As a result we may have to re-learn the typical size of reads and writes, incurring a small performance penalty. The solution is to move the allocating_section object to the memtable_list container. The workload is the same across all memtables of the same table, so we don't lose discrimination here. The performance penalty may be increased later if log changes to memory reserve thresholds including a backtrace, so this reduces the odds of incurring such a penalty. Closes scylladb/scylladb#15737	2023-10-18 17:43:33 +02:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Michał Chojnowski	0273101890	partition_version: remove the unused "from" argument in partition_entry::upgrade() partition_entry now contains a reference to its schema, so it doesn't have to be supplied by the caller anymore.	2023-05-04 02:37:30 +02:00
Michał Chojnowski	94e4dc3d8d	partition_version: add a logalloc::region argument to partition_entry::upgrade() The argument is currently unused, but will be further propagated to add_version() in an upcoming patch.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	98dfe3355e	memtable: propagate the region to memtable_entry::upgrade_schema() Adds a logalloc::region argument to upgrade_schema(). It's currently unused, but will be further propagated to partition_entry::upgrade() in an upcoming patch.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	caaf0bd6bf	partition_version: remove _schema from partition_entry::operator<< operator<< accepts a schema& and a partition_entry&. But since the latter now contains a reference to its schema inside, the former is redundant. Remove it.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	f6e11c95e2	partition_version: remove the schema argument from partition_entry::read() partition_entry now contains a reference to its schema, so it no longer needs to be supplied by the caller.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	4e4ae43a84	memtable: remove _schema from memtable_entry After adding a _schema field to each partition version, the field in memtable_entry is redundant. It can be always recovered from the latest version. Remove it.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	a70c5704df	mutation_partition: change schema_ptr to schema& in mutation_partition constructor Cosmetic change. See the preceding commit for details.	2023-05-04 02:37:29 +02:00
Kefu Chai	c37f4e5252	treewide: use fmt::join() when appropriate now that fmtlib provides fmt::join(). see https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view there is not need to revent the wheel. so in this change, the homebrew join() is replaced with fmt::join(). as fmt::join() returns an join_view(), this could improve the performance under certain circumstances where the fully materialized string is not needed. please note, the goal of this change is to use fmt::join(), and this change does not intend to improve the performance of existing implementation based on "operator<<" unless the new implementation is much more complicated. we will address the unnecessarily materialized strings in a follow-up commit. some noteworthy things related to this change: * unlike the existing `join()`, `fmt::join()` returns a view. so we have to materialize the view if what we expect is a `sstring` * `fmt::format()` does not accept a view, so we cannot pass the return value of `fmt::join()` to `fmt::format()` * fmtlib does not format a typed pointer, i.e., it does not format, for instance, a `const std::string`. but operator<<() always print a typed pointer. so if we want to format a typed pointer, we either need to cast the pointer to `void` or use `fmt::ptr()`. * fmtlib is not able to pick up the overload of `operator<<(std::ostream& os, const column_definition* cd)`, so we have to use a wrapper class of `maybe_column_definition` for printing a pointer to `column_definition`. since the overload is only used by the two overloads of `statement_restrictions::add_single_column_parition_key_restriction()`, the operator<< for `const column_definition*` is dropped. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-16 20:34:18 +08:00
Kefu Chai	0cb842797a	treewide: do not define/capture unused variables these warnings are found by Clang-17 after removing `-Wno-unused-lambda-capture` and '-Wno-unused-variable' from the list of disabled warnings in `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-15 22:57:18 +02:00
Avi Kivity	c5e4bf51bd	Introduce mutation/ module Move mutation-related files to a new mutation/ directory. The names are kept in the global namespace to reduce churn; the names are unambiguous in any case. mutation_reader remains in the readers/ module. mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this patch. This is a step forward towards librarization or modularization of the source base. Closes #12788	2023-02-14 11:19:03 +02:00
Avi Kivity	8d37370a71	Revert "Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj" This reverts commit `bcadd8229b`, reversing changes made to `cf528d7df9`. Since `4bd4aa2e88` ("Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec"), memtable is self-compacting and the extra compaction step only reduces throughput. The unit test in memtable_test.cc is not reverted as proof that the revert does not cause a regression. Closes #11243	2022-08-09 11:23:29 +03:00
Benny Halevy	fcb3347c7a	memtable: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	2d1ba0d7d8	memtable: memtable_encoding_stats_collector: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Avi Kivity	2cb5f79e9d	logalloc, dirty_memory_manager: move size-tracking binomial heap out of logalloc The region_group mechanism used an intrusive heap handle embedded in logalloc::region to allow region_group:s to track the largest region. But with region_group moved out of logalloc, the handle is out of place. Move it out, introducing a new intermediate class size_tracked_region to hold the heap handle. We might eventually merge the new class into memtable (which derives from it), but that requires a large rearrangement of unit tests, so defer that.	2022-07-26 11:12:10 +03:00
Avi Kivity	ee720fa23b	logalloc: relax lifetime rules around region_listener Currently, a region_listener is added during construction and removed during destruction. This was done to mimick the old region(region_group&) constructor, as region_listener replaces region_group. However, this makes moving the binomial heap handle outside logalloc difficult. The natural place for the handle is in a derived class of logalloc::region (e.g. memtable), but members of this derived class will be destroyed earlier than the logalloc::region here. We could play trickes with an earlier base class but it's better to just decouple region lifecycle from listener lifecycle. Do that be adding listen()/unlisten() methods. Some small awkwardness remains in that merge() implicitly unlistens (see comment in region::unlisten). Unit tests are adjusted.	2022-07-26 11:12:10 +03:00
Avi Kivity	cb1251199a	memtable: stop using logalloc::region::group() to test for flushed memtables Currently, the memtable reader uses logalloc::region::group() to test for whether a memtable has been flushed. If a memtable doesn't belong to a region group (from dirty_memory_manager), it is flushed. This is quite tortuous - logalloc::region::merge() makes the merged-from region identical to the merged-to region. The merged-to region, the cache, doesn't have a group, so the check works. Since we're making region groups part of dirty_memory_manager, the cache will no longer have this indirect way of communication with memtable. But instead we can use a direct callback it already has - on_detach_from_region_group(). Use that to set a flag, and examine it in the read path.	2022-07-26 11:07:25 +03:00
Tomasz Grabiec	53026f3ba6	memtable: Subtract from flushed memory when cleaning This patch prevents virtual dirty from going negative during memtable flush in case partition version merging erases data previously accounted by the flush reader. There is an assert in ~flush_memory_accounter which guards for this. This will start happening after tombstones are compacted with rows on partition version merging. This problem is prevented by the patch by having the cleaner notify the memtable layer via callback about the amount of dirty memory released during merging, so that the memtable layer can adjust its accounting.	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	a4e96960b8	mvcc: Apply mutations in memtable with preemption enabled Preerequisite for eagerly applying tombstones, which we want to be preemptible. Before the patch, apply path to the memtable was not preemptible. Because merging can now be defered, we need to involve snapshots to kick-off background merging in case of preemption. This requires us to propagate region and cleaner objects, in order to create a snapshot.	2022-06-15 11:29:43 +02:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Tomasz Grabiec	9135d1fd1f	memtable: Subtract from flushed memory when cleaning This patch prevents virtual dirty from going negative during memtable flush in case partition version merging erases data previously accounted by the flush reader. There is an assert in ~flush_memory_accounter which guards for this. This will start happening after tombstones are compacted with rows on partition version merging. This problem is prevented by the patch by having the cleaner notify the memtable layer via callback about the amount of dirty memory released during merging, so that the memtable layer can adjust its accounting.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	0e3c4fc641	mvcc: Apply mutations in memtable with preemption enabled Preerequisite for eagerly applying tombstones, which we want to be preemptible. Before the patch, apply path to the memtable was not preemptible. Because merging can now be defered, we need to involve snapshots to kick-off background merging in case of preemption. This requires us to propagate region and cleaner objects, in order to create a snapshot.	2022-06-06 19:23:37 +02:00
Botond Dénes	4f77e74bd4	partition_snapshot_reader: convert implementation to native v2 The underlying mutation representation is still v1, so the implementation still has to do conversion. This happens right above the lsa reader component.	2022-04-28 14:12:12 +03:00
Avi Kivity	a9812166cd	replica, partition_snapshot_reader, keys: replace boost::any with std::any Reduce #include load by standardizing on std::any. In keys.cc, we just drop the unneeded include. One instance of boost::any remains in config_file, due to a tie-in with other boost components. Closes #10441	2022-04-28 07:18:53 +03:00
Avi Kivity	e7fb71020b	Merge 'replica: Optimize empty_flat_reader out of the hot path' from Michał Chojnowski When row_cache::make_reader() and memtable::make_flat_reader() see that the query result is empty, they return empty_flat_reader, which is a trivial implementation of flat_mutation_reader. Even though empty_flat_reader doesn't do anything meaningful, it still needs to be created, handled in merging_reader and destroyed. Turns out this is costly. This patch series replaces hot path uses of empty_flat_reader with an empty optional. Performance effects: `perf_simple_query --smp 1` TPS: 138k -> 168k allocs/op: 80.2 -> 71.1 insns/op: 49.9k -> 45.1k `perf_simple_query --smp 1 --enable-cache=1 --flush` TPS: 125k -> 150k allocs/op: 79.2 -> 71.1 insns/op: 51.7k -> 47.2k For a cassandra-stress benchmark (localhost, 100% cache reads) this translates to a TPS increase from ~42k to ~48k per hyperthread. Note that this optimization is effective for single-partition reads where the queried partition is only in cache/sstables or only in memtables. Other queries (e.g. where the partition is in both cache in memtables and needs to be merged) are unaffected. Closes #10204 * github.com:scylladb/scylla: replica: Prefer row_cache::make_reader_opt() to row_cache::make_reader() row_cache: Add row_cache::make_reader_opt() replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader() memtable: Add memtable::make_flat_reader_opt() [avi: adjust #include for readers/ split]	2022-03-14 14:07:00 +02:00
Mikołaj Sielużycki	1d84a254c0	flat_mutation_reader: Split readers by file and remove unnecessary includes. The flat_mutation_reader files were conflated and contained multiple readers, which were not strictly necessary. Splitting optimizes both iterative compilation times, as touching rarely used readers doesn't recompile large chunks of codebase. Total compilation times are also improved, as the size of flat_mutation_reader.hh and flat_mutation_reader_v2.hh have been reduced and those files are included by many file in the codebase. With changes real 29m14.051s user 168m39.071s sys 5m13.443s Without changes real 30m36.203s user 175m43.354s sys 5m26.376s Closes #10194	2022-03-14 13:20:25 +02:00
Michał Chojnowski	f211ef9d71	replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader() The former is significantly cheaper when there is nothing to be read.	2022-03-14 12:02:49 +01:00
Michał Chojnowski	218f2b6e98	memtable: Add memtable::make_flat_reader_opt() When there is nothing to read, make_flat_reader() returns an empty (no-op) reader. But it turns out that constructing, combining and destroying that empty reader is quite costly. As an optimization, add an alternative version which returns an empty optional instead.	2022-03-14 12:02:49 +01:00
Mikołaj Sielużycki	f4c57cbe87	memtable: Convert partition_snapshot_flat_reader to v2. This is a facade change only, the make_partition_snapshot_flat_reader function calls upgrade_to_v2 internally. Closes #10152	2022-03-02 15:07:36 +02:00
Michael Livshin	34ed752885	memtable::make_flush_reader(): return flat_mutation_reader_v2 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michael Livshin	9bacce4359	memtable::make_flat_reader(): return flat_mutation_reader_v2 This is just a facade change. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Michał Radwański	2a3bd40c69	memtable: upgrade scanning_reader and flush_reader to v2 This change is a part of effort to migrate existing readers from old API to the new one. The corresponding make_flush_reader and make_flat_reader functions still return flat_mutation_reader.	2022-02-28 17:11:54 +02:00
Avi Kivity	cbba80914d	memtable: move to replica module and namespace Memtables are a replica-side entity, and so are moved to the replica module and namespace. Memtables are also used outside the replica, in two places: - in some virtual tables; this is also in some way inside the replica, (virtual readers are installed at the replica level, not the cooordinator), so I don't consider it a layering violation - in many sstable unit tests, as a convenient way to create sstables with known input. This is a layering violation. We could make memtables their own module, but I think this is wrong. Memtables are deeply tied into replica memory management, and trying to make them a low-level primitive (at a lower level than sstables) will be difficult. Not least because memtables use sstables. Instead, we should have a memtable-like thing that doesn't support merging and doesn't have all other funky memtable stuff, and instead replace the uses of memtables in sstable tests with some kind of make_flat_mutation_reader_from_unsorted_mutations() that does the sorting that is the reason for the use of memtables in tests (and live with the layering violation meanwhile). Test: unit (dev) Closes #10120	2022-02-23 09:05:16 +02:00

38 Commits