scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 06:05:53 +00:00

Author	SHA1	Message	Date
Botond Dénes	dfd7f03463	tree: s/make_generating_reader_v2/make_generating_reader/ Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	c29c696780	readers: mv from_mutations_v2.hh from_mutations.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	b104862702	tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s Completely mechanical change.	2025-04-16 04:46:07 -04:00
Botond Dénes	7547d0c6a9	readers: mv from_fragments_v2.hh from_fragments.hh Completely mechanical change.	2025-04-16 04:35:00 -04:00
Botond Dénes	f1bd2553ed	readers: mv forwardable_v2.hh forwardable.hh Completely mechanical change.	2025-04-16 04:33:50 -04:00
Botond Dénes	a9d75c4f9d	readers: mv empty_v2.hh empty.hh Completely mechanical change.	2025-04-16 04:32:56 -04:00
Botond Dénes	05829f98f3	tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/ Completely mechanical change.	2025-04-16 04:32:56 -04:00
Botond Dénes	7d9b91a00e	readers: mv delegating_v2.hh delegating.hh Completely mechanical change.	2025-04-16 04:11:55 -04:00
Botond Dénes	c7f68a2649	readers/delegating_v2.hh: move reader definition to _impl.hh file The idea behind readers/ is that each reader has its minimal header with just a factory method declaration. The delegating reader is defined in the factory header because it has a derived class in row_cache_test.cc. Move the definition to delegating_impl.hh so users not interested in deriving from it don't pay the price in header include cost.	2025-04-16 03:47:57 -04:00
Michał Chojnowski	b3d951517d	test/scylla_gdb: generate a coredump when coro_task fails This test fails sometimes, but rarely and unreliably. We want to get a coredump from it the next time it fails. Sending a SIGSEGV should induce that. Refs https://github.com/scylladb/scylladb/issues/22501 Closes scylladb/scylladb#23256	2025-04-15 15:16:38 +03:00
Calle Wilund	abd2d8a58b	test_tools: Manual merge of local key gen tool test from enterprise Fixes scylladb/scylla-enterprise#5358 Transposed tool test for local file generator, originally java test. Then enterprise test. Now here. Closes scylladb/scylladb#23726	2025-04-15 15:14:08 +03:00
Piotr Dulikowski	22e3b8eccd	Merge 'test/cqlpy: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek In this PR, we adjust tests in the cqlpy test suite so they only use RF-rack-valid keyspaces. After that, we enable the configuration option `rf_rack_valid_keyspaces` in the suite by default. Refs scylladb/scylladb#23428 Backport: backporting to 2025.1 so we can test the option there too. Closes scylladb/scylladb#23489 * github.com:scylladb/scylladb: test/cqlpy: Enable rf_rack_valid_keyspaces by default test: Move test_alter_tablet_keyspace_rf to cluster suite test/cqlpy: Adjust tests to RF-rack-valid keyspaces test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces	2025-04-15 12:43:11 +02:00
Nadav Har'El	fbcf77d134	raft: make group0 Raft operation timeout configurable A recent commit `370707b111` (re)introduced a timeout for every group0 Raft operation. This timeout was set to 60 seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody". However, one of the things we do as a group0 operation is schema changes, and we already noticed a few years ago, see commit `0b2cf21932`, that in some extremely overloaded test machines where tests run hundreds of times (!) slower than usual, a single big schema operation - such as Alternator's DeleteTable deleting a table and multiple of its CDC or view tables - sometimes takes more than 60 seconds. The above fix changed the client's timeout to wait for 300 seconds instead of 60 seconds, but now we also need to increase our Raft timeout, or the server can time out. We've seen this happening recently making some tests flaky in CI (issue #23543). So let's make this timeout configurable, as a new configuration option group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e, 60 seconds), the same as the existing default. The test framework overrides this default with a a higher 300 second timeout, matching the client-side timeout. Before this patch, this timeout was already configurable in a strange way, using injections. But this was a misstep: We already have more than a dozen timeouts configurable through the normal configration, and this one should have been configured in the same way. There is nothing "holy" about the default of 60 seconds we chose, and who knows maybe in the future we might need to tweek it in the field, just like we made the other timeouts tweakable. Injections cannot be used in release mode, but configuration options can. Fixes #23543 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23717	2025-04-15 10:57:39 +03:00
Pavel Emelyanov	b25cb5af0c	Merge 'Use named gates' from Benny Halevy Name the gates and phased barriers we use to make it easy to debug gate_closed_exception Refs https://github.com/scylladb/seastar/pull/2688 * Enhancement only, no backport needed Closes scylladb/scylladb#23329 * github.com:scylladb/scylladb: utils: loading_cache: use named_gate utils: flush_queue: use named_gate sstables_manager: use named gate sstables_loader: use named gate utils: phased_barrier, pluggable: use named gate utils: s3::client::multipart_upload: use named gate utils: s3::client: use named_gate transport: controller: use named gate tracing: trace_keyspace_helper: use named gate task_manager: module: use named gate topology_coordinator: use named gate storage_service: use named gate storage_proxy: wait_for_hint_sync_point: use named gate storage_proxy: remote: use named gate service: session: use named gate service: raft: raft_rpc: use named gate service: raft: raft_group0: use named gate service: raft: persistent_discovery: use named gate service: raft: group0_state_machine: use named gate service: migration_manager: use named gate replica: table: use named gate replica: compaction_group, storage_group: use named gate redis: query_processor: use named gate repair: repair_meta: use named gate reader_concurrency_semaphore: use named gate raft: server_impl: use named gate querier_cache: use named gate gms: gossiper: use named gate generic_server: use named gate db: sstables_format_listener: use named gate db: snapshot: backup_task: use named gate db: snapshot_ctl: use named gate hints: hints_sender: use named gate hints: manager: use named gate hints: hint_endpoint_manager: use named gate commitlog: segment_manager: use named gate db: batchlog_manager: use named gate query_processor: remote: use named gate compaction: compaction_state: use named gate alternator/server: use named_gate	2025-04-14 20:56:32 +03:00
Pavel Emelyanov	1bd991a111	test: Inherit sstable_assertions from sstables::test The latter class is invented to let tests access private fields of an sstable (mostly methods). The former is in fact an extended version of that also does some checks. Howerver, they don't inherit from each other, and the sstable_assertions partially duplicates some funtionality of the test one. Add the inheritance, remove the duplicated methods from the child class, update the callers (the test class returns future<>s, the assertions one "knows" it runs in seastar thread) and marm sstable::read_toc() private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23697	2025-04-14 13:45:14 +03:00
Andrei Chekun	8e33d7ab81	test.py: Make the testpy log files in pytest follow the same format Fix the incorrect log file names between conftest and scylla_manager. This regression issue, was introduced in #22960. Currently, scylla manager will output it's logs to the file with the next pattern: suite_name.path_to_the_test_file_with_subfolders.run_id.function_name.mode.run_id_cluster.log On the same time pytest will try to find this log with next name: suite_name.file_name_without_subfolders_path.py.run_id.function_name.mode.run_id_cluster.log This inconsistency leads to the situation when the test failed, scylla manager log file will not be copied to the failed_test directory and test will have exception on teardown. Closes scylladb/scylladb#23596	2025-04-14 12:52:48 +03:00
Evgeniy Naydanov	d6b64642c5	test.py: print out path to Scylla log for Python test suites Test suites with `type: Python` are using single Scylla node created by test.py, but it's handy to print a path to a log file in pytest log too to make it easier to find the file on failures. Closes scylladb/scylladb#23683	2025-04-14 11:15:37 +03:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	bdd5a61139	commitlog: segment_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Dawid Mędrek	be0877ce69	test/cqlpy: Enable rf_rack_valid_keyspaces by default All of the tests in the suite have been adjusted so they only use RF-rack-valid keyspaces, so let's start enabling the option by default.	2025-04-11 14:55:13 +02:00
Dawid Mędrek	a59842257a	test: Move test_alter_tablet_keyspace_rf to cluster suite We move the test `test_alter_tablet_keyspace_rf` from the cqlpy to the cluster test suite. The reason behind the change is that the test cannot be run with `rf_rack_valid_keyspaces` turned on in the configuration. During the test, we make the keyspace RF-rack-invalid multiple times. Since RF-rack-validity is a very strong constraint, adjust the test otherwise is impossible. By moving it to the cluster test suite, we're able to change the configuration of the node used in the test, and so the test can work again.	2025-04-11 14:55:11 +02:00
Dawid Mędrek	958eaec056	test/cqlpy: Adjust tests to RF-rack-valid keyspaces	2025-04-11 14:55:04 +02:00
Dawid Mędrek	6bde01bb59	test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces We adjust three existing Cassandra tests so that they don't create RF-rack-invalid keyspaces. We modify the replication factor used in the problematic tests. The changes don't affect the tests as the value of the RF is unrelated to what they verify. Thanks to that, we can run them now even with enforced RF-rack-valid keyspaces. The drawback is that the modified ALTER statements do not modify the RF at all. However, since the tests seem to verify that the code responsible for VALIDATING a request works as intended, that should have little to no impact on them.	2025-04-11 14:20:14 +02:00
Dawid Mędrek	10589e966f	test/cluster/mv: Adjust test to RF-rack-valid keyspaces We adjust the test in the directory so that all of the used keyspaces are RF-rack-valid throughout the their execution. Refs scylladb/scylladb#23428 Closes scylladb/scylladb#23490	2025-04-11 14:03:21 +02:00
Patryk Jędrzejczak	07a7a75b98	Merge 'raft: implement the limited voters feature' from Emil Maskovsky Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures. Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested). Tests added: * boost/group0_voter_registry_test.cc: run time on CI: ~3.5s * topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total Fixes: scylladb/scylladb#18793 No backport: This is a new feature that will not be backported. Closes scylladb/scylladb#21969 * https://github.com/scylladb/scylladb: raft: distribute voters by rack inside DC raft/test: fix lint warnings in `test_raft_no_quorum` raft/test: add the upgrade test for limited voters feature raft topology: handle on_up/on_down to add/remove node from voters raft: fix the indentation after the limited voters changes raft: implement the limited voters feature raft: drop the voter removal from the decommission raft/test: disable the `stop_before_becoming_raft_voter` test raft/test: stop the server less gracefully in the voters test	2025-04-10 15:29:15 +02:00
Avi Kivity	9559e53f55	Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph. To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer. Closes scylladb/scylladb#23584 * github.com:scylladb/scylladb: tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization tablet-mon.py: Center tablet id text properly in the vertical axis tablet-mon.py: Show migration stage tag in table mode only when migrating virtual-tables: Introduce system.load_per_node virtual_tables: memtable_filling_virtual_table: Propagate permit to execute() docs: virtual-tables: Fix instructions service: tablets: Keep load_stats inside tablet_allocator	2025-04-10 14:59:08 +03:00
Pavel Emelyanov	88318d3b50	topology_coordinator: Use shorter fault-injection overloads There are few places that want to pause until a message is received from the test. There's a convenience one-line suger to do it. One test needs update its expectations about log message that appears when scylle steps on it and actually starts waiting. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23390	2025-04-10 14:05:46 +03:00
Botond Dénes	d67202972a	mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling This adaptor adapts a mutation reader pausable consumer to the frozen mutation visitor interface. The pausable consumer protocol allows the consumer to skip the remaining parts of the partition and resume the consumption with the next one. To do this, the consumer just has to return stop_iteration::yes from one of the consume() overloads for clustering elements, then return stop_iteration::no from consume_end_of_partition(). Due to a bug in the adaptor, this sequence leads to terminating the consumption completely -- so any remaining partitions are also skipped. This protocol implementation bug has user-visible effects, when the only user of the adaptor -- read repair -- happens during a query which has limitations on the amount of content in each partition. There are two such queries: select distinct ... and select ... with partition limit. When converting the repaired mutation to to query result, these queries will trigger the skip sequence in the consumer and due to the above described bug, will skip the remaining partitions in the results, omitting these from the final query result. This patch fixes the protocol bug, the return value of the underlying consumer's consume_end_of_partition() is now respected. A unit test is also added which reproduces the problem both with select distinct ... and select ... per partition limit. Follow-up work: * frozen_mutation_consumer_adaptor::on_end_of_partition() calls the underlying consumer's on_end_of_stream(), so when consuming multiple frozen mutations, the underlying's on_end_of_stream() is called for each partition. This is incorrect but benign. * Improve documentation of mutation_reader::consume_pausable(). Fixes: #20084 Closes scylladb/scylladb#23657	2025-04-10 13:19:57 +03:00
Dawid Mędrek	0ed21d9cc1	test/cluster/test_tablets.py: Fix test errorneous indentation Some of the statements in the test are not indented properly and, as a result, are never run. It's most likely a small mistake, so let's fix it. Closes scylladb/scylladb#23659	2025-04-10 11:06:01 +03:00
Nadav Har'El	258213f73b	Merge 'Alternator batch count histograms' from Amnon Heiman This series adds a histogram for get and write batch sizes. It uses the estimated_histogram implementation which starts from 1 with 1.2 exponential factor, which works extremely tight to 20 but still covers all the way to 100. Histograms will be reported per node. Backport to 2025.1 so we'll have information about user batch size limitation Closes scylladb/scylladb#23379 * github.com:scylladb/scylladb: alternator: Add tests for the batch items histograms alternator: Add histogram for batch item count	2025-04-09 22:41:14 +03:00
Tomasz Grabiec	b5211cca85	Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk Currently, when we rebuild a tablet, we stream data from all replicas. This creates a lot of redundancy, wastes bandwidth and CPU resources. In this series, we split the streaming stage of tablet rebuild into two phases: first we stream tablet's data from only one replica and then repair the tablet. Fixes: https://github.com/scylladb/scylladb/issues/17174. Needs backport to 2025.1 to prevent out of space during streaming Closes scylladb/scylladb#23187 * github.com:scylladb/scylladb: test: add test for rebuild with repair locator: service: move to rebuild_v2 transition if cluster is upgraded locator: service: add transition to rebuild_repair stage for rebuild_v2 locator: service: add rebuild_repair tablet transition stage locator: add maybe_get_primary_replica locator: service: add rebuild_v2 tablet transition kind gms: add REPAIR_BASED_TABLET_REBUILD cluster feature	2025-04-09 21:35:37 +02:00
Avi Kivity	ed3e4f33fd	Merge 'generic_server: throttle and shed incoming connections according to semaphore limit' from Marcin Maliszkiewicz Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency. It should help to reduce cpu usage by limiting cpu concurrency for new connections. As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed. New connections_shed and connections_blocked metrics are added for tracking. Testing: - manually via simple program creating high number of connection and constantly re-connecting - added benchmark Following are benchmark results: Before: ``` > build/release/test/perf/perf_generic_server --smp=1 170101.41 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4695 insns/op, 3178 cycles/op, 0 errors) [...] throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54 instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96 cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15 ``` After: ``` > build/release/test/perf/perf_generic_server --smp=1 167373.19 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4821 insns/op, 3371 cycles/op, 0 errors) [...] throughput: mean= 171199.55 standard-deviation=2484.58 median= 171667.06 median-absolute-deviation=2087.63 maximum=173689.11 minimum=167904.76 instructions_per_op: mean= 4801.90 standard-deviation=16.54 median= 4796.78 median-absolute-deviation=9.32 maximum=4830.71 minimum=4789.81 cpu_cycles_per_op: mean= 3245.26 standard-deviation=32.28 median= 3230.44 median-absolute-deviation=16.52 maximum=3297.39 minimum=3215.62 ``` The patch adds around 67 insns/op so it's effect on performance should be negligible. Fixes: https://github.com/scylladb/scylladb/issues/22844 Closes scylladb/scylladb#22828 * github.com:scylladb/scylladb: transport: move on_connection_close into connection destructor test: perf: make aggregated_perf_results formatting more human readable transport: add blocked and shed connection metrics generic_server: throttle and shed incoming connections according to semaphore limit generic_server: add data source and sink wrappers bookkeeping network IO generic_server: coroutinize part of server::do_accepts test: add benchmark for generic_server test: perf: add option to count multiple ops per time_parallel iteration generic_server: add semaphore for limiting new connections concurrency generic_server: add config to the constructor generic_server: add on_connection_ready handler	2025-04-09 21:41:38 +03:00
Tomasz Grabiec	0b9a75d7b6	virtual-tables: Introduce system.load_per_node Can be used to query per-node stats about load as seen by the load balancer. In particular, node's capacity will be used by tablet-mon.py to scale tablet columns so that equal height is equal node utilization.	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	668094dc58	virtual_tables: memtable_filling_virtual_table: Propagate permit to execute() So that population can access read's timeout and mark the permit as awaiting.	2025-04-09 20:21:51 +02:00
Marcin Maliszkiewicz	619944555f	test: perf: make aggregated_perf_results formatting more human readable Before: throughput: mean=170728.58 standard-deviation=1921.76 median=171084.16 median-absolute-deviation=1501.58 maximum=172913.36 minimum=167288.97 instructions_per_op: mean=4685.89 standard-deviation=12.46 median=4683.92 median-absolute-deviation=9.68 maximum=4706.53 minimum=4666.70 cpu_cycles_per_op: mean=3090.94 standard-deviation=52.69 median=3103.43 median-absolute-deviation=24.55 maximum=3192.99 minimum=3003.00 After: throughput: mean= 168224.81 standard-deviation=854.48 median= 168829.02 median-absolute-deviation=604.21 maximum=168829.02 minimum=167620.60 instructions_per_op: mean= 4837.02 standard-deviation=20.89 median= 4851.79 median-absolute-deviation=14.77 maximum=4851.79 minimum=4822.24 cpu_cycles_per_op: mean= 3271.42 standard-deviation=46.29 median= 3304.16 median-absolute-deviation=32.73 maximum=3304.16 minimum=3238.69	2025-04-09 10:49:20 +02:00
Marcin Maliszkiewicz	719d04d501	test: add benchmark for generic_server Changes in configure.py are needed becuase we don't want to embed this benchmark in scylla binary as perf_simple_query or perf_alternator, it doesn't directly translate to Scylla performance but we want to use aggregated_perf_results for precise cpu measurements so we need different dependecies.	2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz	b957cedace	test: perf: add option to count multiple ops per time_parallel iteration	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	ed82bede39	generic_server: add semaphore for limiting new connections concurrency It will be used in following commits.	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	33122d3f93	generic_server: add config to the constructor	2025-04-09 10:30:58 +02:00
Botond Dénes	b65a76ab6f	Merge 'nodetool: cluster repair: add a command to repair tablet keyspaces' from Aleksandra Martyniuk Add a new nodetool cluster super-command. Add nodetool cluster repair command to repair tablet keyspaces. It uses the new /storage_service/tablets/repair API. The nodetool cluster repair command allows you to specify the keyspace and tables to be repaired. A cluster repair of many tables will request /storage_service/tablets/repair and wait for the result synchronously for each table. The nodetool repair command, which was previously used to repair keyspaces of any type, now repairs only vnode keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/22409. Needs backport to 2025.1 that introduces the new tablet repair API Closes scylladb/scylladb#22905 * github.com:scylladb/scylladb: docs: nodetool: update repair and add tablet-repair docs test: nodetool: add tests for cluster repair command nodetool: add cluster repair command nodetool: repair: extract getting hosts and dcs to functions nodetool: repair: warn about repairing tablet keyspaces nodetool: repair: move keyspace_uses_tablets function	2025-04-09 08:20:34 +03:00
Botond Dénes	5f697d373f	test/cqlpy/test_tools.py: use AIO backend in scylla-sstable query tests These tests seem to be hitting the io-uring bug in the kernel from time-to-time, making CI flaky. Force the use of the AIO backend in these tests, as a workaround until fixed kernels (>=6.8.13) are available. Fixes: #23517 Fixes: #23546 Closes scylladb/scylladb#23648	2025-04-08 20:29:58 +03:00
Tomasz Grabiec	06b49bdf69	Merge 'row_cache: don't garbage-collect tombstones which cover data in memtables' from Botond Dénes The row cache can garbage-collect tombstones in two places: 1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it; 2) During reads - reads now compact data including garbage collection; In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables. This PR includes fixes for (2), which were not handled at all currently. (1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included. Fixes: https://github.com/scylladb/scylladb/issues/23291 Fixes: https://github.com/scylladb/scylladb/issues/23252 The fix will need backport to all live release. Closes scylladb/scylladb#23255 * github.com:scylladb/scylladb: test/boost/row_cache_test: add memtable overlap check tests replica/table: add error injection to memtable post-flush phase utils/error_injection: add a way to set parameters from error injection points test/cluster: add test_data_resurrection_in_memtable.py test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts replica/mutation_dump: don't assume cells are live replica/database: do_apply() add error injection point replica: improve memtable overlap checks for the cache replica/memtable: add is_merging_to_cache() db/row_cache: add overlap-check for cache tombstone garbage collection mutation/mutation_compactor: copy key passed-in to consume_new_partition()	2025-04-08 17:26:58 +02:00
Robert Bindar	4e3eb2fdac	Move direct_failure_detector from root to service/ direct_failure_detector used to be used by gms/ as well, but that's not the case anymore, so raft/ is the only user. Fixes #23133 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#23248	2025-04-08 13:03:24 +03:00
Aleksandra Martyniuk	372b562f5e	test: add test for rebuild with repair	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	02fb71da42	test: nodetool: add tests for cluster repair command	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	b81c81c7f4	nodetool: repair: warn about repairing tablet keyspaces Warn about an attempt to repair tablet keysapce with nodetool repair. A nodetool cluster repair command to repair tablet keyspaces will be added in the following patches.	2025-04-08 09:13:14 +02:00
Raphael S. Carvalho	0f59deffaa	replica: Fix truncate and drop table after tablet migration happens When running those operations after a tablet replica is migrated away from a shard, an assert can fail resulting in a crash. Status quo (around the assert in truncate procedure): 1) Highest RP seen by table is saved in low_mark, and the current time in low_mark_at. 2) Then compaction is disabled in order to not mix data written before truncate, and data written later. 3) Then memtable is flushed in order for the data written before truncate to be available in sstables and then removed. 4) Now, current time is saved in truncated_at, which is supposedly the time of truncate to decide which sstables to remove. Note: truncated_at is likely above low_mark_at due to steps 2 and 3. The interesting part of the assert is: (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp) Note: RP in the assert above is the highest RP among all sstables generated before truncated_at. RP is retrieved by table::discard_sstables(). If truncated_at > low_mark_at, maybe newer data was written during steps 2 and 3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with RP > low_mark. So assert's 2nd condition is there to defend against the scenario above. truncated_at and low_mark_at uses millisecond granularity, so even if truncated_at == low_mark_at, data could have been written in steps 2 and 3 (during same MS window), failing the assert. This is fragile. Reproducer: To reproduce the problem, truncated_at must be > low_mark_at, which can easily happen with both drop table and truncate due to steps 2 and 3. If a shard has 2 or more tablets, the table's highest RP refer to just one tablet in that shard. If the tablet with the highest RP is migrated away, then the sstables in that shard will have lower RP than the recorded highest RP (it's a table wide state, which makes sense since CL is shared among tablets). So when either drop table or truncate runs, low_mark will be potentially bigger than highest RP retrieved from sstables. Proposed solution: The current assert is hacked to not fail if writes sneak in, during steps 2 and 3, but it's still fragile and seems not to serve its real purpose, since it's allowing for RP > low_mark. We should be able to say that low_mark >= RP, as a way of asserting we're not leaving data targeted by truncate behind (or that we're not removing the wrong data). But the problem is that we're saving low_mark in step 1, before preparation steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying all data written so far is targeted for removal. But as of today, low_mark refers to all data written up to step 1. So low_mark is now only one set before issuing flush, and also accounts for all potentially flushed data. Fixes #18059. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23560	2025-04-08 07:32:58 +03:00
Botond Dénes	0d39091df2	test/boost/row_cache_test: add memtable overlap check tests Similar to test/cluster/test_data_resurrection_in_memtable.py but works on a single node and uses more low-level mechanism. These tests can also reproduce more advanced scenarios, like concurrent reads, with some reading from flushed memtables.	2025-04-08 00:11:36 -04:00
Botond Dénes	34b18d7ef4	test/cluster: add test_data_resurrection_in_memtable.py Reproducers for #23252 and #23291 -- cache garbage collecting tombstones resurrecting data in the memtable.	2025-04-08 00:11:36 -04:00
Botond Dénes	e5afd9b5fb	test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts Such that a given index in the return hosts refers to the same underlying Scylla instance, as the same index in the passed-in nodes list. This is what users of this method intuitively expect, but currently the returned hosts list is unordered (has random order).	2025-04-08 00:11:36 -04:00

1 2 3 4 5 ...

8643 Commits