scylladb

Author	SHA1	Message	Date
Botond Dénes	7cb8cab2ae	Merge 'Remove make_shared_schema() helper' from Pavel Emelyanov This function was obsoleted by schema_builder some time ago. Not to patch all its callers, that helper became wrapper around it. Remained users are all in tests, and patching the to use builder directory makes the code shorter in many cases. Closes scylladb/scylladb#20466 * github.com:scylladb/scylladb: schema: Ditch make_shared_schema() helper test: Tune up indentation in uncompressed_schema() test: Make tests use schema_builder instead of make_shared_schema	2024-09-13 12:25:10 +03:00
Benny Halevy	7d893a5ed9	compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats When purging regular tombstone consult the min_live_timestamp, if available. For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp, if available, otherwise fallback to the min_live_timestamp. If both are missing, fallback to the legacy (and inaccurate) min_timestamp. Fixes scylladb/scylladb#20423 Fixes scylladb/scylladb#20424 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-09-10 19:05:57 +03:00
Benny Halevy	57e9e9c369	compaction: define max_purgeable_fn Before we add a new, is_shadowable, parameter to it. And define global `can_always_purge` and `can_never_purge` functions, a-la `always_gc` and `never_gc`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-09-10 19:05:57 +03:00
Benny Halevy	b6fabd98c6	tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh And define `never_gc` globally, same as `always_gc` Before adding a new, is_shadowable parameter to it. Since it is used in the context of compaction it better fits compaction_garbage_collector header rather than tombstone.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-09-10 19:05:57 +03:00
Pavel Emelyanov	a1deba0779	test: Make tests use schema_builder instead of make_shared_schema Everything, but perf test is straightforward switch. The perf-test generated regular columns dynamically via vector, with builder the vector goes away. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-09-05 19:31:30 +03:00
Łukasz Paszkowski	ba2f037af5	mutation_partition: drop reverse parameter in compact_for_query The reverse parameter is no longer used with native reverse reads. The row ranges are provided in native reverse order together with a reversed schema, thus the reverse parameter remain false all the time and can be droped.	2024-08-13 10:07:12 +02:00
Botond Dénes	78206a3fad	test/boost: mutation_test: add test for cell compaction stats	2024-08-06 08:56:28 -04:00
Botond Dénes	259a59bd64	mutation/compact_and_expire_result: drop operator bool() Having an operator bool() on this struct is counter-intuitive, so this commit drops it and migrates any remaining users to bool is_live(). The purpose of this operator bool() was to help in incrementally replace the previous bool return type with compact_and_expire_result in the compact_and_expire() call stack. Now that this is done, it has served its purpose.	2024-08-06 08:56:28 -04:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Benny Halevy	9f05072527	token: make kind-based ctor private Users outside of the token module don't need to mess with the token::kind. They can only create key tokens. Never, minimum or maximum tokens, with a particular datya value. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-07-20 21:21:42 +03:00
Raphael S. Carvalho	ad5c5bca5f	replica: get rid of fragile compaction group intrusive list It was added to make integration of storage groups easier, but it's complicated since it's another source of truth and we could have problems if it becomes inconsistent with the group map. Fixes #18506. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-09 16:53:35 -03:00
Avi Kivity	fdc1449392	treewide: rename flat_mutation_reader_v2 to mutation_reader flat_mutation_reader_v2 was introduced in a pair of commits in 2021: `e3309322c3` "Clone flat_mutation_reader related classes into v2 variants" `08b5773c12` "Adapt flat_mutation_reader_v2 to the new version of the API" as a replacement for flat_mutation_reader, using range_tombstone_change instead of range_tombstone to represent represent range tombstones. See those commits for more information. The transition was incremental; the last use of the original flat_mutation_reader was removed in 2022 in commit `026f8cc1e7` "db: Use mutation_partition_v2 in mvcc" In turn, flat_mutation_reader was introduced in 2017 in commit `748205ca75` "Introduce flat_mutation_reader" To transition from a mutation_reader that nested rows within a partition in a separate stream, to a flat reader that streamed partitions and rows in the same stream. Here, we reclaim the original name and rename the awkward flat_mutation_reader_v2 to mutation_reader. Note that mutation_fragment_v2 remains since we still use the original for compatibilty, sometimes. Some notes about the transition: - files were also renamed. In one case (flat_mutation_reader_test.cc), the rename target already existed, so we rename to mutation_reader_another_test.cc. - a namespace 'mutation_reader' with two definitions existed (in mutation_reader_fwd.hh). Its contents was folded into the mutation_reader class. As a result, a few #includes had to be adjusted. Closes scylladb/scylladb#19356	2024-06-21 07:12:06 +03:00
Kefu Chai	222dbf2ce4	test/boost: include test/lib/test_utils.hh this change was created in the same spirit of 505900f18f. because we are deprecating the operator<< for vector and unorderd_map in Seastar, some tests do not compile anymore if we disable these operators. so to be prepared for the change disabling them, let's include test/lib/test_utils.hh for accessing the printer dedicated for Boost.test. and also '#include <fmt/ranges.h>' when necessary, because, in order to format the ranges using {fmt}, we need to use fmt/ranges.h. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-05-26 12:32:43 +08:00
Avi Kivity	2fbd78c769	feature: grandfather DIGEST_FOR_NULL_VALUES The DIGEST_FOR_NULL_VALUES feature was added in `21a77612b3` (2020; 4.4) and can now be assumed to be always present. The hasher which it invoked is removed.	2024-05-18 00:24:00 +03:00
Benny Halevy	e1411f3911	mutation_partition: add apply_gently To be used for freezing mutations or making canonical mutations gently. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 18:45:24 +03:00
Benny Halevy	15e8ecb670	mutation_partition: apply_monotonically: do not support schema upgrade Currently, if the input mutation_partition requires schema upgrade, apply_monotonically always silently reverts to being non-preemptible, even if the caller passed is_preemptible::yes. To prevent that from happening, put the burden of upgrading the mutation_partition schem on the caller, which is today the apply() methods, which are synchronous anyhow. With that, we reduce the proliferation of the `apply_monotonically` overloads and keep only the low level one (which could potentially be private as well, as it's called only from within the mutation/ source files and from tests) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 18:42:41 +03:00
Petr Gusev	db1afa0aba	mutation: add split_mutation function The function splits the source mutation into multiple mutations so that their size does not exceed the max_size limit. The size of a mutation is calculated as the sum of the memory_usage() of its constituent mutation_fragments. The implementation is taken from view_updating_consumer. We use mutation_rebuilder_v2 to reconstruct mutations from a stream of mutation fragments and recreate the output mutation whenever we reach the limit. We'll need this function in the next commit.	2024-03-20 22:39:51 +04:00
Kefu Chai	1fe7a467e7	mutation: add fmt::formatter for mutation_fragment_v2::printer before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for mutation_fragment_v2::printer Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-26 17:47:05 +08:00
Avi Kivity	51df8b9173	interval: rename nonwrapping_interval to interval Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token. We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility. Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping. We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias.	2024-02-21 19:43:17 +02:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Kefu Chai	97587a2ea4	test/boost: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17139	2024-02-06 13:22:16 +02:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Yaniv Kaul	c658bdb150	Typos: fix typos in comments Fixes some typos as found by codespell run on the code. In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc. Follow-up commits will take care of them. Refs: https://github.com/scylladb/scylladb/issues/16255 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2023-12-02 22:37:22 +02:00
Petr Gusev	beb29f094b	system_keyspace: drop load phases We want to switch system.scylla_local table to the schema commitlog, but load phases hamper here - schema commitlog is initialized after phase1, so a table which is using it should be moved to phase2, but system.scylla_local contains features, and we need them before schema commitlog initialization for SCHEMA_COMMITLOG feature. In this commit we are taking a different approach to loading system tables. First, we load them all in one pass in 'readonly' mode. In this mode, the table cannot be written to and has not yet been assigned a commit log. To achieve this we've added _readonly bool field to the table class, it's initialized to true in table's constructor. In addition, we changed the table constructor to always assign nullptr to commitlog, and we trigger an internal error if table.commitlog() property is accessed while the table is in readonly mode. Then, after triggering on_system_tables_loaded notifications on feature_service and sstable_format_selector, we call system_keyspace::mark_writable and eventually table::mark_ready_for_writes which selects the proper commitlog and marks the table as writable. In sstable_compaction_test we drop several mark_ready_for_writes calls since they are redundant, the table has already been made writable in env.make_table_for_tests call. The table::commitlog function either returns the current commitlog or causes an error if the table is readonly. This didn't work for virtual tables, since they never called mark_ready_for_writes. In this commit we add this call to initialize_virtual_tables.	2023-09-13 23:17:20 +04:00
Botond Dénes	a35f4f6985	test/mutation_test: test_compactor_validator_sanity_test Greatly expand this test to check that the compactor validates the input stream properly. The test is renamed (the _sanity_test suffix is removed) to reflect the expanded scope.	2023-07-20 08:48:50 -04:00
Kefu Chai	fa3129fa29	treewide: use unsigned variable to compare with unsigned some times we initialize a loop variable like auto i = 0; or int i = 0; but since the type of `0` is `int`, what we get is a variable of `int` type, but later we compare it with an unsigned number, if we compile the source code with `-Werror=sign-compare` option, the compiler would warn at seeing this. in general, this is a false alarm, as we are not likely to have a wrong comparison result here. but in order to prevent issues due to the integer promotion for comparison in other places. and to prepare for enabling `-Werror=sign-compare`. let's use unsigned to silence this warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-18 10:27:18 +08:00
Michał Chojnowski	c41f0ebd2a	test: mutation_test: unflake test_external_memory_usage The test has about 1/2500000 chance to fail due to a conflict of random values. And it recently did, just to spite us. Fight back. Fixes #14563 Closes #14576	2023-07-08 15:20:25 +03:00
Michał Chojnowski	5ad0846bff	view: fix range tombstone handling on flushes in view_updating_consumer View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes #14503	2023-07-04 20:33:21 +02:00
Benny Halevy	761d62cd82	compare_atomic_cell_for_merge: compare value last for live cells Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was changed in CASSANDRA-14592 for consistency with the preference for dead cells over live cells, as expiring cells will become tombstones at a future time and then they'd win over live cells with the same timestamp, hence they should win also before expiration. In addition, comparing the cell value before expiration can lead to unintuitive corner cases where rewriting a cell using the same timestamp but different TTL may cause scylla to return the cell with null value if it expired in the meanwhile. Also, when multiple columns are written using two upserts using the same write timestamp but with different expiration, selecting cells by their value may return a mixed result where each cell is selected individually from either upsert, by picking the cells with the largest values for each column, while using the expiration time to break tie will lead to a more consistent results where a set of cell from only one of the upserts will be selected. Fixes scylladb/scylladb#14182 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 10:10:39 +03:00
Benny Halevy	ec034b92c0	mutation_test: test_cell_ordering: improve debuggability Currently, it is hard to tell which of the many sub-cases fail in this unit test, in case any of them fails. This change uses logging in debug and trace level to help with that by reproducing the error with --logger-log-level testlog=trace (The cases are deterministic so reproducing should not be a problem) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 10:10:39 +03:00
Tomasz Grabiec	51e3b9321b	Merge ' mvcc: make schema upgrades gentle' from Michał Chojnowski After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls. This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery. Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version. After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and by the MVCC version merge ongoing in the background after the upgrade. The series: 1. Does some code cleanup in the mutation_partition area. 2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry). 3. Adds upgrading variants of constructors and apply() for `row` and its wrappers. 4. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains. 5. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version. Fixes #2577 Closes #13761 * github.com:scylladb/scylladb: test: mvcc_test: add a test for gentle schema upgrades partition_version: make partition_entry::upgrade() gentle partition_version: handle multi-schema snapshots in merge_partition_versions mutation_partition_v2: handle schema upgrades in apply_monotonically() partition_version: remove the unused "from" argument in partition_entry::upgrade() row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades partition_version: handle multi-schema entries in partition_entry::squashed partition_snapshot_row_cursor: handle multi-schema snapshots partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots partition_version: add a logalloc::region argument to partition_entry::upgrade() memtable: propagate the region to memtable_entry::upgrade_schema() mutation_partition: add an upgrading variant of lazy_row::apply() mutation_partition: add an upgrading variant of rows_entry::rows_entry mutation_partition: switch an apply() call to apply_monotonically() mutation_partition: add an upgrading variant of rows_entry::apply_monotonically() mutation_fragment: add an upgrading variant of clustering_row::apply() mutation_partition: add an upgrading variant of row::row partition_version: remove _schema from partition_entry::operator<< partition_version: remove the schema argument from partition_entry::read() memtable: remove _schema from memtable_entry row_cache: remove _schema from cache_entry partition_version: remove the _schema field from partition_snapshot partition_version: add a _schema field to partition_version mutation_partition: change schema_ptr to schema& in mutation_partition::difference mutation_partition: change schema_ptr to schema& in mutation_partition constructor mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor mutation_partition: add upgrading variants of row::apply() partition_version: update the comment to apply_to_incomplete() mutation_partition_v2: clean up variants of apply() mutation_partition: remove apply_weak() mutation_partition_v2: remove a misleading comment in apply_monotonically() row_cache_test: add schema changes to test_concurrent_reads_and_eviction mutation_partition: fix mixed-schema apply()	2023-05-24 22:58:43 +02:00
Michał Chojnowski	bc6a07a16a	mutation_partition: change schema_ptr to schema& in mutation_partition::difference Cosmetic change. See the preceding commit for details.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	a70c5704df	mutation_partition: change schema_ptr to schema& in mutation_partition constructor Cosmetic change. See the preceding commit for details.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	781514acfe	mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor We don't have a convention for when to pass `schema_ptr` and and when to pass `const schema&` around. In general, IMHO the natural convention for such a situation is to pass the shared pointer if the callee might extend the lifetime of shared_ptr, and pass a reference otherwise. But we convert between them willy-nilly through shared_from_this(). While passing a reference to a function which actually expects a shared_ptr can make sense (e.g. due to the fact that smart pointers can't be passed in registers), the other way around is rather pointless. This patch takes one occurence of that and modifies the parameter to a reference. Since enable_shared_from_this makes shared pointer parameters and reference parameters interchangeable, this is a purely cosmetic change.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	49a02b08de	mutation_partition_v2: clean up variants of apply() Most variants of apply() and apply_monotonically() in mutation_partition_v2 are leftovers from mutation_partition, and are unused. Thus they only add confusion and maintenance burden. Since we will be modifying apply_monotonically() in upcoming patches, let's clean them up, lest the variants become stale. This patch removes all unused variants of apply() and apply_monotonically() and "manually inlines" the variants which aren't used often enough to carry their own weight. In the end, we are left with a single apply_monotonically() and two convenience apply() helpers. The single apply_monotonically() accepts two schema arguments. This facility is unimplemented and unused as of this patch - the two arguments are always the same - but it will be implemented and used in later parts of the series.	2023-05-04 02:37:29 +02:00
Botond Dénes	4365f004c1	test/boost/mutation_test: add sanity test for mutation compaction validator Checking that compacted fragments are forwarded to the validator intact.	2023-05-03 04:19:42 -04:00
Kamil Braun	30cc07b40d	Merge 'Introduce tablets' from Tomasz Grabiec This PR introduces an experimental feature called "tablets". Tablets are a way to distribute data in the cluster, which is an alternative to the current vnode-based replication. Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. With tablets, the aim is to start from a different side. Divide resources of replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets). This will allow us to balance the load in a more flexible manner, by moving individual tablets around. Also, unlike with vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind raft groups to tablets. Those goals are not yet achieved with this PR, but it lays the ground for this. Things achieved in this PR: - You can start a cluster and create a keyspace whose tables will use tablet-based replication. This is done by setting `initial_tablets` option: ``` CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3, 'initial_tablets': 8}; ``` All tables created in such a keyspace will be tablet-based. Tablet-based replication is a trait, not a separate replication strategy. Tablets don't change the spirit of replication strategy, it just alters the way in which data ownership is managed. In theory, we could use it for other strategies as well like EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy is augmented to support tablets. - You can create and drop tablet-based tables (no DDL language changes) - DML / DQL work with tablet-based tables Replicas for tablet-based tables are chosen from tablet metadata instead of token metadata Things which are not yet implemented: - handling of views, indexes, CDC created on tablet-based tables - sharding is done using the old method, it ignores the shard allocated in tablet metadata - node operations (topology changes, repair, rebuild) are not handling tablet-based tables - not integrated with compaction groups - tablet allocator piggy-backs on tokens to choose replicas. Eventually we want to allocate based on current load, not statically Closes #13387 * github.com:scylladb/scylladb: test: topology: Introduce test_tablets.py raft: Introduce 'raft_server_force_snapshot' error injection locator: network_topology_strategy: Support tablet replication service: Introduce tablet_allocator locator: Introduce tablet_aware_replication_strategy locator: Extract maybe_remove_node_being_replaced() dht: token_metadata: Introduce get_my_id() migration_manager: Send tablet metadata as part of schema pull storage_service: Load tablet metadata when reloading topology state storage_service: Load tablet metadata on boot and from group0 changes db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() migration_notifier: Introduce before_drop_keyspace() migration_manager: Make prepare_keyspace_drop_announcement() return a future<> test: perf: Introduce perf-tablets test: Introduce tablets_test test: lib: Do not override table id in create_table() utils, tablets: Introduce external_memory_usage() db: tablets: Add printers db: tablets: Add persistence layer dht: Use last_token_of_compaction_group() in split_token_range_msb() locator: Introduce tablet_metadata dht: Introduce first_token() dht: Introduce next_token() storage_proxy: Improve trace-level logging locator: token_metadata: Fix confusing comment on ring_range() dht, storage_proxy: Abstract token space splitting Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() db: Introduce get_non_local_vnode_based_strategy_keyspaces() service: storage_proxy: Avoid copying keyspace name in write handler locator: Introduce per-table replication strategy treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type locator: Introduce effective_replication_map locator: Rename effective_replication_map to vnode_effective_replication_map locator: effective_replication_map: Abstract get_pending_endpoints() db: Propagate feature_service to abstract_replication_strategy::validate_options() db: config: Introduce experimental "TABLETS" feature db: Log replication strategy for debugging purposes db: Log full exception on error in do_parse_schema_tables() db: keyspace: Remove non-const replication strategy getter config: Reformat	2023-04-27 09:40:18 +02:00
Kefu Chai	cc87e10f40	dht: print pk in decorated_key with "pk" prefix this change ensures that `dk._key` is formatted with the "pk" prefix. as in `3738fcb`, the `operator<<` for partition_key was removed. so the compiler has to find an alternative when trying to fulfill the needs when this operator<< is called. fortunately, from the compiler's perspective, `partition_key` has an `operator managed_bytes_view`, and this operator does not have the explicit specifier, and, `managed_bytes_view` does support `operator<<`. so this ends up with a change in the format of `decorated_key` when it is printed using `operator<<`. the code compiles. but unfortunately, the behavior is changed, and it breaks scylla-dtest/cdc_tracing_info_test.py where the partition_key is supposed to be printed like "pk{010203}" instead of "010203". the latter is how `managed_bytes_view` is formatted. a test is added accordingly to avoid future changes which break the dtest. Fixes scylladb#13628 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13653	2023-04-25 09:53:47 +02:00
Tomasz Grabiec	9b17ad3771	locator: Introduce per-table replication strategy Will be used by tablet-based replication strategies, for which effective replication map is different per table. Also, this patch adapts existing users of effective replication map to use the per-table effective replication map. For simplicity, every table has an effective replication map, even if the erm is per keyspace. This way the client code can be uniform and doesn't have to check whether replication strategy is per table. Not all users of per-keyspace get_effective_replication_map() are adapted yet to work per-table. Those algorithms will throw an exception when invoked on a keyspace which uses per-table replication strategy.	2023-04-24 10:49:36 +02:00
Botond Dénes	bae62f899d	mutation/mutation_compactor: consume_partition_end(): reset _stop The purpose of `_stop` is to remember whether the consumption of the last partition was interrupted or it was consumed fully. In the former case, the compactor allows retreiving the compaction state for the given partition, so that its compaction can be resumed at a later point in time. Currently, `_stop` is set to `stop_iteration::yes` whenever the return value of any of the `consume()` methods is also `stop_iteration::yes`. Meaning, if the consuming of the partition is interrupted, this is remembered in `_stop`. However, a partition whose consumption was interrupted is not always continued later. Sometimes consumption of a partitions is interrputed because the partition is not interesting and the downstream consumer wants to stop it. In these cases the compactor should not return an engagned optional from `detach_state()`, because there is not state to detach, the state should be thrown away. This was incorrectly handled so far and is fixed in this patch, but overwriting `_stop` in `consume_partition_end()` with whatever the downstream consumer returns. Meaning if they want to skip the partition, then `_stop` is reset to `stop_partition::no` and `detach_state()` will return a disengaged optional as it should in this case. Fixes: #12629 Closes #13365	2023-03-29 17:48:45 +03:00
Pavel Emelyanov	e882269d93	table: Keep storage options lw-shared-ptr Tables need to know which storage their sstables need to be located at, so class table needs to have itw reference of the storage options. The thing can be inherited from the keyspace metadata. Tests sometimes create table without keyspace at hand. For those use default-initialized storage options (which is local storage). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-03-16 17:30:45 +03:00
Kefu Chai	3ae11de204	treewide: do not define/capture unused variables these warnings are found by Clang-17 after removing `-Wno-unused-lambda-capture` and '-Wno-unused-variable' from the list of disabled warnings in `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-28 21:56:53 +08:00
Botond Dénes	3c30531202	Merge 'test: mutation_test: Fix sporadic failure due to continuity mismatch' from Tomasz Grabiec In test_v2_apply_monotonically_is_monotonic_on_alloc_failures we generate mutations with non-full continuity, so we should pass is_evictable::yes to apply_monotonically(). Otherwise, it will assume fully-continuous versions and not try to maintain continuity by inserting sentinels. This manifested in sporadic failures on continuity check. Fixes #12882 Closes #12921 * github.com:scylladb/scylladb: test: mutation_test: Fix sporadic failure due to continuity mismatch test: mutation_test: Fix copy-paste mistake in trace-level logging	2023-02-20 12:46:15 +01:00
Tomasz Grabiec	2ae8f74cec	test: mutation_test: Fix sporadic failure due to continuity mismatch In test_v2_apply_monotonically_is_monotonic_on_alloc_failures we generate mutations with non-full continuity, so we should pass is_evictable::yes to apply_monotonically(). Otherwise, it will assume fully-continuous versions and not try to maintain continuity by inserting sentinels. This manifested in sporadic failures on continuity check. Fixes #12882	2023-02-17 14:43:32 +01:00
Tomasz Grabiec	22063713d7	test: mutation_test: Fix copy-paste mistake in trace-level logging	2023-02-17 14:42:47 +01:00
Avi Kivity	e2f6e0b848	utils: move hashing related files to utils/ module Closes #12884	2023-02-17 07:19:52 +02:00
Avi Kivity	69a385fd9d	Introduce schema/ module Schema related files are moved there. This excludes schema files that also interact with mutations, because the mutation module depends on the schema. Those files will have to go into a separate module. Closes #12858	2023-02-15 11:01:50 +02:00
Avi Kivity	c5e4bf51bd	Introduce mutation/ module Move mutation-related files to a new mutation/ directory. The names are kept in the global namespace to reduce churn; the names are unambiguous in any case. mutation_reader remains in the readers/ module. mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this patch. This is a step forward towards librarization or modularization of the source base. Closes #12788	2023-02-14 11:19:03 +02:00
Botond Dénes	511c0123a2	Merge 'Add compaction module to task manager' from Aleksandra Martyniuk Introduces task manager's compaction module. That's an initial part of integration of compaction with task manager. When fully integrated, task manager will allow user to track compaction operations, check status and progress of each individual one. It will help with creating an asynchronous version of rest api that forces any compaction. Currently, users can see with /task_manager/list_modules api call that compaction is one of the modules accessible through task manager. They won't get any additional information though, since compaction tasks are not created yet. A shared_ptr to compaction module is kept in compaction manager. Closes #12635 * github.com:scylladb/scylladb: compaction: test: pass task_manager to compaction_manager in test environment compaction: create and register task manager's module for compaction tasks: add task_manager constructor without arguments	2023-02-06 09:25:05 +02:00
Avi Kivity	f73e2c992f	Merge 'Keep range tombstones with rows in memtables and cache' from Tomasz Grabiec This series switches memtable and cache to use a new representation for mutation data, called `mutation_partition_v2`. In this representation, range tombstone information is stored in the same tree as rows, attached to row entries. Each entry has a new tombstone field, which represents range tombstone part which applies to the interval between this entry and the previous one. See docs/dev/mvcc.md for more details about the format. The transient mutation object still uses the old model in order to avoid work needed to adapt old code to the new model. It may also be a good idea to live with two models, since the transient mutation has different requirements and thus different trade-offs can be made. Transient mutation doesn't need to support eviction and strong exception guarantees, so its algorithms and in-memory representation can be simpler. This allows us to incrementally evict range tombstone information. Before this series, range tombstones were accumulated and evicted only when the whole partition entry was evicted. This could lead to inefficient use of cache memory. Another advantage of the new representation is that reads don't have to lookup range tombstone information in a different tree while reading. This leads to simpler and more efficient readers. There are several disadvantages too. Firstly, rows_entry is now larger by 16 bytes. Secondly, update algorithms are more complex because they need to deoverlap range tombstone information. Also, to handle preemption and provide strong exception guarantees, update algorithms may need to allocate sentinel entries, which adds complexity and reduces performance. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Remaining work: - performance optimizations to apply_monotonically() to avoid regressions - performance testing - preemption support in apply_to_incomplete (cache update from memtable) Fixes #2578 Fixes #3288 Fixes #10587 Closes #12048 * github.com:scylladb/scylladb: test: mvcc: Extend some scenarios with exhaustive consistency checks on eviction test: mvcc: Extract mvcc_container::allocate_in_region() row_cache, lru: Introduce evict_shallow() test: mvcc: Avoid copies of mutation under failure injection test: mvcc: Add missing logalloc::reclaim_lock to test_apply_is_atomic mutation_partition_v2: Avoid full scan when applying mutation to non-evictable Pass is_evictable to apply() tests: mutation_partition_v2: Introduce test_external_memory_usage_v2 mirroring the test for v1 tests: mutation: Fix test_external_memory_usage() to not measure mutation object footprint tests: mutation_partition_v2: Add test for exception safety of mutation merging tests: Add tests for the mutation_partition_v2 model mutation_partition_v2: Implement compact() cache_tracker: Extract insert(mutation_partition_v2&) mvcc, mutation_partition: Document guarantees in case merging succeeds mutation_partition_v2: Accept arbitrary preemption source in apply_monotonically() mutation_partition_v2: Simplify get_continuity() row_cache: Distinguish dummy insertion site in trace log db: Use mutation_partition_v2 in mvcc range_tombstone_change_merger: Introduce peek() readers: Extract range_tombstone_change_merger mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots mvcc: partition_snapshot_row_cursor: Support digest calculation mutation_partition_v2: Store range tombstones together with rows db: Introduce mutation_partition_v2 doc: Introduce docs/dev/mvcc.md db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU db: Print range_tombstone bounds as position_in_partition test: memtable_test: Relax test_segment_migration_during_flush test: cache_flat_mutation_reader: Avoid timestamp clash test: cache_flat_mutation_reader_test: Use monotonic timestamps when inserting rows test: mvcc: Fix sporadic failures due to compact_for_compaction() test: lib: random_mutation_generator: Produce partition tombstone less often test: lib: random_utils: Introduce with_probability() test: lib: Improve error message in has_same_continuity() test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker test: mvcc: Insert entries in the tracker test: mvcc_test: Do not set dummy::no on non-clustering rows mutation_partition: Print full position in error report in append_clustered_row() db: mutation_cleaner: Extract make_region_space_guard() position_in_partition: Optimize equality check mvcc: Fix version merging state resetting mutation_partition: apply_resume: Mark operator bool() as explicit	2023-02-05 22:33:10 +02:00

1 2 3 4

161 Commits