scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 12:17:02 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	f2ed9fcd7e	schema_mutations, migration_manager: Ignore empty partitions in per-table digest Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. Tombstones expire 7 days after schema change which introduces them. If one of the nodes is restarted after that, it will compute a different table schema digest on boot. This may cause performance problems. When sending a request from coordinator to replica, the replica needs schema_ptr of exact schema version request by the coordinator. If it doesn't know that version, it will request it from the coordinator and perform a full schema merge. This adds latency to every such request. Schema versions which are not referenced are currently kept in cache for only 1 second, so if request flow has low-enough rate, this situation results in perpetual schema pulls. After `ae8d2a550d`, it is more liekly to run into this situation, because table creation generates tombstones for all schema tables relevant to the table, even the ones which will be otherwise empty for the new table (e.g. computed_columns). This change inroduces a cluster feature which when enabled will change digest calculation to be insensitive to expiry by ignoring empty partitions in digest calculation. When the feature is enabled, schema_ptrs are reloaded so that the window of discrepancy during transition is short and no rolling restart is required. A similar problem was fixed for per-node digest calculation in 18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation was not fixed at that time because we didn't persist enabled features and they were not enabled early-enough on boot for us to depend on them in digest calculation. Now they are enabled before non-system tables are loaded so digest calculation can rely on cluster features. Fixes #4485.	2023-07-03 23:06:55 +02:00
Alejo Sanchez	520bd90008	test/boost/memtable_test: split test plain/reverse Split long running test test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests for _plain and _reverse. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14447	2023-07-03 15:20:12 +03:00
Piotr Dulikowski	ee9bfb583c	combined: mergers: remove recursion in operator()() In mutation_reader_merger and clustering_order_reader_merger, the operator()() is responsible for producing mutation fragments that will be merged and pushed to the combined reader's buffer. Sometimes, it might have to advance existing readers, open new and / or close some existing ones, which requires calling a helper method and then calling operator()() recursively. In some unlucky circumstances, a stack overflow can occur: - Readers have to be opened incrementally, - Most or all readers must not produce any fragments and need to report end of stream without preemption, - There has to be enough readers opened within the lifetime of the combined reader (~500), - All of the above needs to happen within a single task quota. In order to prevent such a situation, the code of both reader merger classes were modified not to perform recursion at all. Most of the code of the operator()() was moved to maybe_produce_batch which does not recur if it is not possible for it to produce a fragment, instead it returns std::nullopt and operator()() calls this method in a loop via seastar::repeat_until_value. A regression test is added. Fixes: scylladb/scylladb#14415 Closes #14452	2023-06-30 12:07:13 +03:00
Nadav Har'El	dd63169077	Merge 'test/boost/index_with_paging_test: reduce running time' from Alecco Reduce test string value size, parallelize inserts, and use a prepared statement, The debug running time for this tests is reduced from 13:18 to 7:52. Refs #13905 Closes #14380 * github.com:scylladb/scylladb: test/boost/index_with_paging_test: parallel insert test/boost/index_with_paging_test: prepared statement test/boost/index_with_paging_test: reduce running time	2023-06-29 10:45:01 +03:00
Kamil Braun	b912eeade5	Merge 'merge raft commands to group0 before applying them whenever possible' from Gleb Since most group0 commands are just mutations it is easy to combine them before passing them to a subsystem they destined to since it is more efficient. The logic that handles those mutations in a subsystem will run once for each batch of commands instead of for each individual command. This is especially useful when a node catches up to a leader and gets a lot of commands together. The patch here does exactly that. It combines commands into a single command if possible, but it preserves an order between commands, so each time it encounters a command to a different subsystem it flushes already combined batch and starts a new one. This extra safety assumes that there are dependencies between subsystems managed by group0, so the order matters. It may be not the case now, but we prefer to be on a safe side. Broadcast table commands are not mutations, so they are never combined. * 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla: test: add test for group0 raft command merging service: raft: respect max mutation size limit when persisting raft entries group0_state_machine: merge commands before applying them whenever possible	2023-06-28 17:21:07 +02:00
Alejo Sanchez	d4697ed21e	test/boost/index_with_paging_test: parallel insert Parallelize inserts for long-running test_index_with_paging. Run time in debug mode reduced by 1 minute 48 seconds. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-06-28 16:11:58 +02:00
Alejo Sanchez	70a3179888	test/boost/index_with_paging_test: prepared statement Prepare statement for insert. Run time in debug mode reduced by 9 seconds. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-06-28 14:49:21 +02:00
Alejo Sanchez	48d24269f1	test/boost/index_with_paging_test: reduce running time Reduce test string value size for test_index_with_paging from 4096 to 100. With 100 bytes it should make the base row significantly larger than the key so the test will exercise both types of paging in the scanning code. The debug running time for this tests is reduced from 9 minutes to 6 minutes. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-06-28 13:55:52 +02:00
Kamil Braun	96bc78905d	readers: evictable_reader: don't accidentally consume the entire partition The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction. So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491. There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order. Fix the comparison and adjust one of the tests (added in #13563) to detect this case. Fixes #13491	2023-06-27 14:37:29 +02:00
Kamil Braun	5800ce8ddd	test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position test_range_tombstones_v2 is too strict for this reader -- it expects a particular sequence of `range_tombstone_change`s, but multishard_combining_reader, when tested with a small buffer, may generate -- as expected -- additional (redundant) range tombstone change pairs (end+start). Currently we don't observe these redundant fragments due to a bug in `evictable_reader_v2` but they start appearing once we fix the bug and the test must be prepared first. To prepare the test, modify `flat_reader_assertions_v2` so it squashes redundant range tombstone change pairs. This happens only in non-exact mode. Enable exact mode in `test_sstable_reversing_reader_random_schema` for comparing two readers -- the squashing of `r_t_c`s may introduce an artificial difference.	2023-06-27 14:37:25 +02:00
Gleb Natapov	945f476363	test: add test for group0 raft command merging Add a test that submits 3 large commands each one a little bit larger than 1/3 of maximum mutation size. Check that in the end 2 command were executed (first 2 were merged and third was executed separately).	2023-06-27 14:59:55 +03:00
Botond Dénes	f5e3b8df6d	Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho View building from staging creates a reader from scratch (memtable \+ sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: ``` + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert ``` That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from `INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s` to `INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s` Refs https://github.com/scylladb/scylladb/issues/14089. Fixes scylladb/scylladb#14244. Closes #14364 * github.com:scylladb/scylladb: table: Optimize creation of reader excluding staging for view building view_update_generator: Dump throughput and duration for view update from staging utils: Extract pretty printers into a header	2023-06-27 07:25:30 +03:00
Raphael S. Carvalho	1d8cb32a5d	table: Optimize creation of reader excluding staging for view building View building from staging creates a reader from scratch (memtable + sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s to INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s Refs #14089. Fixes #14244. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-06-26 22:30:39 -03:00
Raphael S. Carvalho	83c70ac04f	utils: Extract pretty printers into a header Can be easily reused elsewhere. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-06-26 21:58:20 -03:00
Alexey Novikov	ca4e7f91c6	compact and remove expired rows from cache on read when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs #2252, #6033 Closes #12917	2023-06-26 15:29:01 +02:00
Alejo Sanchez	4999cbc1cf	test/boost/cql_functions_test: split long running tests Split long running test_aggregate_functions to one case per type. This allows test.py to run them in parallel. Before this it would take 18 minutes to run in debug mode. Afterwards each case takes 30-45 seconds. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14368	2023-06-26 11:29:36 +03:00
Alejo Sanchez	8b1968cfbb	test/boost/schema_changes_test: split long-running test Split long running test test_schema_changes in 3 parts, one for each writable_sstable_versions so it can be run in parallel by test.py. Add static checks to alert if the array of types changed. Original test takes around 24 minutes in debug mode, and each new split test takes around 8 minutes. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14367	2023-06-26 11:24:07 +03:00
Alejo Sanchez	633f026d63	test/boost/memtable_test: allow parallel run Remove previous configuration blocking parallel run. Test cases run fine in local debug. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14369	2023-06-26 11:23:43 +03:00
Alejo Sanchez	3cbfd863eb	test/boost/database_test: split long running tests Split long running tests test_database_with_data_in_sstables_is_a_mutation_source_plain and test_database_with_data_in_sstables_is_a_mutation_source_reverse. They run with x_log2_compaction_groups of 0 and 1, each one taking from 10 to 15 minutes each in debug mode, for a total of 28 and 22 minutes. Split the test cases to run with 0 and 1, so test.py can run them in parallel. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14356	2023-06-26 11:20:27 +03:00
Avi Kivity	b858a4669d	cql3: expr: break up expression.hh header Adding a function declaration to expression.hh causes many recompilations. Reduce that by: - moving some restrictions-related definitions to the existing expr/restrictions.hh - moving evaluation related names to a new header expr/evaluate.hh - move utilities to a new header expr/expr-utilities.hh expression.hh contains only expression definitions and the most basic and common helpers, like printing.	2023-06-22 14:21:03 +03:00
Kefu Chai	f014ccf369	Revert "Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai"" This reverts commit `562087beff`. The regressions introduced by the reverted change have been fixed. So let's revert this revert to resurrect the uuid_sstable_identifier_enabled support. Fixes #10459	2023-06-21 13:02:40 +03:00
Avi Kivity	e233f471b8	Merge 'Respect tablet shard assignment' from Tomasz Grabiec This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets): 1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored. 2. CDC subsystem was not adjusted (not supported yet) 3. sstable sharding metadata reflects tablet boundaries 5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables) 6. The system is NOT prepared to handle tablet migration / topology changes in a safe way. 7. Sstable cleanup is not wired properly yet After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead. To make the life easier, support was added to obtain table pointer from the schema pointer: ``` schema_ptr s; s->table().shard_of(...) ``` Closes #13939 * github.com:scylladb/scylladb: locator: network_topology_startegy: Allocate shards to tablets locator: Store node shard count in topology service: topology: Extract topology updating to a lambda test: Move test_tablets under topology_experimental sstables: Add trace-level logging related to shard calculation schema: Catch incorrect uses of schema::get_sharder() dht: Rename dht::shard_of() to dht::static_shard_of() treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of() storage_proxy: Avoid multishard reader for tablets storage_proxy: Obtain shard from erm in the read path db, storage_proxy: Drop mutation/frozen_mutation ::shard_of() forward_service: Use table sharder alternator: Use table sharder db: multishard: Obtain sharder from erm sstable_directory: Improve trace-level logging db: table: Introduce shard_of() helper db: Use table sharder in compaction sstables: Compute sstable shards using sharder from erm when loading sstables: Generate sharding metadata using sharder from erm when writing test: partitioner: Test split_range_to_single_shard() on tablet-like sharder dht: Make split_range_to_single_shard() prepared for tablet sharder sstables: Move compute_shards_for_this_sstable() to load() dht: Take sharder externally in splitting functions locator: Make sharder accessible through effective_replication_map dht: sharder: Document guarantees about mapping stability tablets: Implement tablet sharder tablets: Include pending replica in get_shard() dht: sharder: Introduce next_shard() db: token_ring_table: Filter out tablet-based keyspaces db: schema: Attach table pointer to schema schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load() schema_registry: Make learn(schema_ptr) attach entry to the target schema test: lib: cql_test_env: Expose feature_service test: Extract throttle object to separate header	2023-06-21 10:20:41 +03:00
Calle Wilund	f18e967939	storage_proxy: Make split_stats resilient to being called from different scheduling group Fixes #11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics. Closes #14294	2023-06-21 10:08:27 +03:00
Tomasz Grabiec	ebdebb982b	locator: network_topology_startegy: Allocate shards to tablets Uses a simple algorihtm for allocating shards which chooses least-loaded shard on a given node, encapsulated in load_sketch. Takes load due to current tablet allocation into account. Each tablet, new or allocated for other tables, is assumed to have an equal load weight.	2023-06-21 00:58:25 +02:00
Tomasz Grabiec	29cbdb812b	dht: Rename dht::shard_of() to dht::static_shard_of() This is in order to prevent new incorrect uses of dht::shard_of() to be accidentally added. Also, makes sure that all current uses are caught by the compiler and require an explicit rename.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	21198e8470	treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of() dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e48ec6fed3	db, storage_proxy: Drop mutation/frozen_mutation ::shard_of() dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	d92287f997	db: multishard: Obtain sharder from erm This is not strictly necessary, as the multishard reader will be later avoided altogether for tablet-based tables, but it is a step towards converting all code to use the erm->get_sharder() instead of schema::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	36da062bcb	db: Use table sharder in compaction	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	ad983ac23d	sstables: Compute sstable shards using sharder from erm when loading schema::get_sharder() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should obtain the sharder from erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	36e12020b9	test: partitioner: Test split_range_to_single_shard() on tablet-like sharder	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	28b972a588	dht: Make split_range_to_single_shard() prepared for tablet sharder The function currently assumes that shard assignment for subsequent tokens is round robin, which will not be the case for tablets. This can lead to incorrect split calculation or infinite loop. Another assumption was that subsequent splits returned by the sharder have distinct shards. This also doesn't hold for tablets, which may return the same shard for subsequent tokens. This assumption was embedded in the following line: start_token = sharder.token_for_next_shard(end_token, shard); If the range which starts with end_token is also owned by "shard", token_for_next_shard() would skip over it.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	390bcf3fae	dht: Take sharder externally in splitting functions We need those functions to work with tablet sharder, which is not accessible through schema::get_sharder(). In order to propagate the right sharder, those functions need to take it externally rather from the schema object. The sharder will come from the effective_replication_map attached to the table object. Those splitting functions are used when generating sharding metadata of an sstable. We need to keep this sharding metadata consistent with tablet mapping to shards in order for node restart to detect that those sstables belong to a single shard and that resharding is not necessary. Resharding of sstables based on tablet metadata is not implemented yet and will abort after this series. Keeping sharding metadata accurate for tablets is only necessary until compaction group integration is finished. After that, we can use the sstable token range to determine the owning tablet and thus the owning shard. Before that, we can't, because a single sstable may contain keys from different tablets, and the whole key range may overlap with keys which belong to other shards.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	22ab100b41	tablets: Implement tablet sharder	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e44e6033d8	tablets: Include pending replica in get_shard() We need to move get_shard() from tablet_info to tablet_map in order to have access to transition_info.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	2303466375	db: schema: Attach table pointer to schema This will make it easier to access table proprties in places which only have schema_ptr. This is in particular useful when replacing dht::shard_of() uses with s->table().shard_of(), now that sharding is no longer static, but table-specific. Also, it allows us to install a guard which catches invalid uses of schema::get_sharder() on tablet-based tables. It will be helpful for other uses as well. For example, we can now get rid of the static_props hack.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	ad6d2b42f2	test: Extract throttle object to separate header	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	87b4606cd6	Merge 'atomic_cell: compare value last' from Benny Halevy Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in `e225c88a65` ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. Fixes #14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 Closes #14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability	2023-06-20 12:11:48 +02:00
Benny Halevy	761d62cd82	compare_atomic_cell_for_merge: compare value last for live cells Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was changed in CASSANDRA-14592 for consistency with the preference for dead cells over live cells, as expiring cells will become tombstones at a future time and then they'd win over live cells with the same timestamp, hence they should win also before expiration. In addition, comparing the cell value before expiration can lead to unintuitive corner cases where rewriting a cell using the same timestamp but different TTL may cause scylla to return the cell with null value if it expired in the meanwhile. Also, when multiple columns are written using two upserts using the same write timestamp but with different expiration, selecting cells by their value may return a mixed result where each cell is selected individually from either upsert, by picking the cells with the largest values for each column, while using the expiration time to break tie will lead to a more consistent results where a set of cell from only one of the upserts will be selected. Fixes scylladb/scylladb#14182 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 10:10:39 +03:00
Benny Halevy	ec034b92c0	mutation_test: test_cell_ordering: improve debuggability Currently, it is hard to tell which of the many sub-cases fail in this unit test, in case any of them fails. This change uses logging in debug and trace level to help with that by reproducing the error with --logger-log-level testlog=trace (The cases are deterministic so reproducing should not be a problem) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 10:10:39 +03:00
Tomasz Grabiec	5fa08adc88	Merge 'cache_flat_mutation_reader: use the correct schema in prepare_hash' from Michał Chojnowski Since `mvcc: make schema upgrades gentle` (`51e3b9321b`), rows pointed to by the cursor can have different (older) schema than the schema of the cursor's snapshot. However, one place in the code wasn't updated accordingly, causing a row to be processed with the wrong schema in the right circumstances. This passed through unit testing because it requires a digest-computing cache read after a schema change, and no test exercised this. This series fixes the bug and adds a unit test which reproduces the issue. Fixes #14110 Closes #14305 * github.com:scylladb/scylladb: test: boost/row_cache_test: add a reproducer for #14110 cache_flat_mutation_reader: use the correct schema in prepare_hash mutation: mutation_cleaner: add pause()	2023-06-20 01:30:11 +02:00
Michał Chojnowski	02bcb5d539	test: boost/row_cache_test: add a reproducer for #14110	2023-06-19 22:50:46 +02:00
Botond Dénes	bd7a3e5871	Merge 'Sanitize sstables-making utils in tests' from Pavel Emelyanov There are tons of wrappers that help test cases make sstables for their needs. And lots of code duplication in test cases that do parts of those helpers' work on their own. This set cleans some bits of those Closes #14280 * github.com:scylladb/scylladb: test/utils: Generalize making memtable from vector<mutation> test/util: Generalize make_sstable_easy()-s test/sstable_mutation: Remove useless helper test/sstable_mutation: Make writer config in make_sstable_mutation_source() test/utils: De-duplicate make_sstable_containing-s test/sstable_compaction: Remove useless one-line local lambda test/sstable_compaction: Simplify sstable making test/sstables*: Make sstable from vector of mutations test/mutation_reader: Remove create_sstable() helper from test	2023-06-19 14:05:29 +03:00
Pavel Emelyanov	6bec03f96f	test: Remove sstable_utils' storage_prefix() helper It's excessive, test case that needs it can get storage prefix without this fancy wrapper-helper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14273	2023-06-19 13:51:04 +03:00
Pavel Emelyanov	1a332ef5e2	test: Check sstable bytes correctness on S3 too Commit `4e205650` (test: Verify correctness of sstable::bytes_on_disk()) added a test to verify that sstable::bytes_on_disk() is equal to the real size of real files. The same test case makes sense for S3-backed sstables as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14272	2023-06-19 13:47:31 +03:00
Nadav Har'El	ac3d0d4460	Merge 'cql3: expr: support evaluate(column_mutation_attribute)' from Avi Kivity In preparation for converting selectors to evaluate expressions, add support for evaluating column_mutation_attribute (representing the WRITETIME/TTL pseudo-functions). A unit test is added. Fixes #12906 Closes #14287 * github.com:scylladb/scylladb: test: expr: test evaluation of column_mutation_attribute test: lib: enhance make_evaluation_inputs() with support for ttls/timestamps cql3: expr: evaluate() column_mutation_attribute	2023-06-19 11:11:49 +03:00
Botond Dénes	562087beff	Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai" This reverts commit `d1dc579062`, reversing changes made to `3a73048bc9`. Said commit caused regressions in dtests. We need to investigate and fix those, but in the meanwhile let's revert this to reduce the disruption to our workflows. Refs: #14283	2023-06-19 08:49:27 +03:00
Avi Kivity	0f98e9f8c8	test: expr: test evaluation of column_mutation_attribute There's no way to evaluate a column_mutation_attribute via CQL yet (the only user uses old-style cql3::selection::selector), so we only supply a unit test.	2023-06-18 22:47:46 +03:00
Nadav Har'El	97d444bbf7	Merge 'cql3/expression: implement evaluate(field_selection) ' from Jan Ciołek Implement `expr:valuate()` for `expr::field_selection`. `field_selection` is used to represent access to a struct field. For example, with a UDT value: ``` CREATE TYPE my_type (a int, b int); ``` The expression `my_type_value.a` would be represented as a `field_selection`, which selects the field `a`. Evaluating such an expression consists of finding the right element's value in a serialized UDT value and returning it. Note that it's still not possible to use `field_selection` inside the `WHERE` clause. Enabling it would require changes to the grammar, as well as query planning, Current `statement_restrictions` just reacts with `on_internal_error` when it encounters a `field_selection`. Nonetheless it's a step towards relaxing the grammar, and now it's finally possible to evaluate all kinds of prepared expressions (#12906) Fixes: https://github.com/scylladb/scylladb/issues/12906 Closes #14235 * github.com:scylladb/scylladb: boost/expr_test: test evaluate(field_selection) cql3/expr: fix printing of field_selection cql3/expression: implement evaluate(field_selection) types/user: modify idx_of_field to use bytes_view column_identifer: add column_identifier_raw::text() types: add read_nth_user_type_field() types: add read_nth_tuple_element()	2023-06-18 11:08:25 +03:00
Pavel Emelyanov	85310bc043	test/sstable_mutation: Remove useless helper There are two make_sstable_mutation_source() helpers that call one another and test cases only need one of them, so leave just one that's in use. Also don't pass env's tempdir to make_sstable() util call, it can get env's tempdir on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:21:40 +03:00

1 2 3 4 5 ...

2632 Commits