scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-02 04:56:58 +00:00

Author	SHA1	Message	Date
Aleksandra Martyniuk	85d9565158	test: repair: drop log checks from test_repair_succeeds_with_unitialized_bm Currently, test_repair_succeeds_with_unitialized_bm checks whether repair finishes successfully and the error is properly handled if batchlog_manager isn't initialized. Error handling depends on logs, making the test fragile to external conditions and flaky. Drop the error handling check, successful repair is a sufficient passing condition. Fixes: #21167. Closes scylladb/scylladb#21208	2024-10-28 08:39:16 +02:00
Botond Dénes	be70755f47	Merge 'repair: Fix finished ranges metrics for removenode' from Asias He The skipped ranges should be multiplied by the number of tables Otherwise the finished ranges ratio will not reach 100%. Fixes #21174 Closes scylladb/scylladb#21252 * github.com:scylladb/scylladb: test: Add test_node_ops_metrics.py repair: Make the ranges more consistent in the log repair: Fix finished ranges metrics for removenode	2024-10-28 08:09:32 +02:00
Asias He	9868ccbac0	test: Add test_node_ops_metrics.py It tests the node_ops_metrics_done metric reaches 100% when a node ops is done. Refs: #21174	2024-10-28 08:45:37 +08:00
Kamil Braun	f5c60e538d	Merge 'cql/tablets: fix retrying ALTER tablets KEYSPACE' from Piotr Smaron ALTER tablets-enabled KEYSPACES (KS) may fail due to `group0_concurrent_modification`, in which case it's repeated by a `for` loop surrounding the code. But because raft's `add_entry` consumes the raft's guard (by `std::move`'ing the guard object), retries of ALTER KS will use a moved-from guard object, which is UB, potentially a crash. The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event will be repeated by the topology state machine if it receives the concurrent modification exception, because the event will remain present in the global requests queue, hence it's going to be executed as the very next event. Note: refactor is implemented in the follow-up commit. Fixes: scylladb/scylladb#21102 Should be backported to every 6.x branch, as it may lead to a crash. Closes scylladb/scylladb#21121 * github.com:scylladb/scylladb: test: add UT to test retrying ALTER tablets KEYSPACE cql/tablets: fix indentation in `rf_change` event handler cql/tablets: fix retrying ALTER tablets KEYSPACE	2024-10-23 10:01:21 +02:00
Botond Dénes	519e167611	Merge 'replica/table: check memtable before discarding tombstone during read' from Lakshmi Narayanan Sreethar On the read path, the compacting reader is applied only to the sstable reader. This can cause an expired tombstone from an sstable to be purged from the request before it has a chance to merge with deleted data in the memtable leading to data resurrection. Fix this by checking the memtables before deciding to purge tombstones from the request on the read path. A tombstone will not be purged if a key exists in any of the table's memtables with a minimum live timestamp that is lower than the maximum purgeable timestamp. Fixes #20916 `perf-simple-query` stats before and after this fix : `build/Dev/scylla perf-simple-query --smp=1 --flush` : ``` // Before this Fix // --------------- 94941.79 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59393 insns/op, 24029 cycles/op, 0 errors) 97551.14 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59376 insns/op, 23966 cycles/op, 0 errors) 96599.92 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59367 insns/op, 23998 cycles/op, 0 errors) 97774.91 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59370 insns/op, 23968 cycles/op, 0 errors) 97796.13 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59368 insns/op, 23947 cycles/op, 0 errors) throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79 instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02 cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19 // After this Fix // -------------- 95313.53 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59392 insns/op, 24058 cycles/op, 0 errors) 97311.48 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59375 insns/op, 24005 cycles/op, 0 errors) 98043.10 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59381 insns/op, 23941 cycles/op, 0 errors) 96750.31 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59396 insns/op, 24025 cycles/op, 0 errors) 93381.21 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59390 insns/op, 24097 cycles/op, 0 errors) throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21 instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73 cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22 ``` This PR fixes a regression introduced in `ce96b472d3` and should be backported to older versions. Closes scylladb/scylladb#20985 * github.com:scylladb/scylladb: topology-custom: add test to verify tombstone gc in read path replica/table: check memtable before discarding tombstone during read compaction_group: track maximum timestamp across all sstables	2024-10-23 10:28:00 +03:00
Nadav Har'El	5fd3177057	Merge 'mv: add a dedicated read concurrency semaphore for view update read before writes' from Wojciech Mitros When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload. This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency. The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory. This is fixed by using the read admission on the view update concurrency semaphore. This limits the number of concurrent view update reads to max_count_concurrent_view_update_reads, all other incoming view update reads are queued using just a small chunk of memory. Without this, the reads would also get queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue. The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction. This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload. The issue of view updates causing increased latencies most often occurs in the following scenario: * we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows * we have any read workload * one replica is slower or is handling more writes due to an imbalance of data distribution * we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it. * each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued * the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario: * the tables were prepared satisfying the conditions above * we use a medium write workload and a very low read workload * the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update * to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond. In the test above: * without the fix, the latency of reads increases over 50s * with the fix, the latency of reads stays below 20ms Fixes https://github.com/scylladb/scylladb/issues/8873 Fixes https://github.com/scylladb/scylladb/issues/15805 The patch is not that small and it isn't fixing a regression, so no backports Closes scylladb/scylladb#20887 * github.com:scylladb/scylladb: test: add test for high view update concurrency causing bad_allocs test: add test for high view update concurrency degrading read latency mv: add a dedicated read concurrency semaphore for view update read before writes	2024-10-22 22:17:23 +03:00
Piotr Smaron	522bede8ec	test: add UT to test retrying ALTER tablets KEYSPACE The newly added testcase is based on the already existing `test_alter_dropped_tablets_keyspace`. A new error injection is created, which stops the ALTER execution just before the changes are submitted to RAFT. In the meantime, a new schema change is performed using the 2nd node in the cluster, thus causing the 1st node to retry the ALTER statement.	2024-10-22 18:22:01 +02:00
Wojciech Mitros	4d719bacca	test: add test for high view update concurrency causing bad_allocs This commit add a test for checking whether a large view update workload can cause Scylla to run out of memory. In the test, we keep writing to a table table with a materialized view with a limited number of rows, causing overwrites which require reading from the table to perform view updates. Currently, due to the unlimited concurrency of view update reads, we may use too much memory which can lead to bad_allocs, causing Scylla to fail. To reach the failing state more consistently, we use add a sleep after reading the old value of the base row, to keep the reader concurrency semaphore units longer. At the same time, we use high concurrency and large row size to use up all Scylla's memory quickly. The test fails if Scylla runs out of memory and aborts, and succeeds otherwise.	2024-10-21 12:35:20 +02:00
Wojciech Mitros	f2c740710c	test: add test for high view update concurrency degrading read latency This commit add a test for checking whether a large view update workload impacts the latency of other user reads. In the test, we first create a table for reads and another table with a materialized view. We then start writing to the table with the view with a limited number of rows - when overwriting, we need to read the previous value of the row to prepare a delete of the old row in the view. This should not impact the latency of the read workload from the other table that we start at the same time. The test fails if any of the reads times out. To reach the failing state more consistantly, we use add a sleep after reading the old value of the base row, to keep the reader concurrency semaphore units longer. At the same time, we use a lower threshold for queueing reads on the semaphore, to see the impact of view update reads earlier. Because of the high load, the writes may timeout, but that's expected - we fail the test only if the user reads time out.	2024-10-21 12:34:55 +02:00
Lakshmi Narayanan Sreethar	afad1b3c85	topology-custom: add test to verify tombstone gc in read path Co-authored-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-10-18 19:20:03 +05:30
Piotr Dulikowski	a380a2efd9	test/test_view_build_status: properly wait for v2 in migration test The test_view_build_status_migration_to_v2 test case creates a new view (vt2) after peforming the view_build_status -> view_build_status_v2 migration and waits until it is built by `wait_for_view_v2` function. It works by waiting until a SELECT from view_build_status_v2 will return the expected number of rows for a given view. However, if the host parameter is unspecified, it will query only one node on each attempt. Because `view_build_status_v2` is managed via raft, queries always return data from the queried node only. It might happen that `wait_for_view_v2` fetches expected results from one node while a different node might be lagging behind the group0 coordinator and might not have all data yet. In case of test_view_build_status_migration_to_v2 this is a problem - it first uses `wait_for_view_v2` to wait for view, later it queries `view_build_status_v2` on a random node and asserts its state - and might fail because that node didn't have the newest state yet. Fix the issue by issuing `wait_for_view_v2` in parallel for all nodes in the cluster and waiting until all nodes have the most recent state. Fixes: scylladb/scylladb#21060 Closes scylladb/scylladb#21091	2024-10-15 14:57:47 +03:00
Piotr Smaron	3969ffb39f	test: fix flaky `test_multidc_alter_tablets_rf` The testcase is flaky due to a known python driver issue: https://github.com/scylladb/python-driver/issues/317. This issue causes the `CREATE KEYSPACE` statement to be sometimes executed twice in a row, and the 2nd CREATE statement causes the test to fail. In order to work around it, it's enough to add `if not exists` when creating a ks. Fixes: scylladb/scylladb#21034 Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch. Closes scylladb/scylladb#21056	2024-10-14 16:18:44 +02:00
Sergey Zolotukhin	132358dc92	tests: Add tests for alter table with RF=1 to RF=0 Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor from 1 to 0 for one of two data centers. Tablet version fails due to issue described in scylladb/scylladb#20625. Test for scylladb/scylladb#20625	2024-10-11 09:38:24 +02:00
Lakshmi Narayanan Sreethar	69c385f540	compaction: make drain wait for compactions to stop during shutdown During shutdown, the compaction_manager starts stopping ongoing compaction tasks through `really_do_stop()` method as soon as it receives a signal from the abort source. Later, when the database object shuts down, it calls `compaction_manager::drain` to ensure that all compaction tasks have stopped. However, `compaction_manager::drain` is currently implemented in such a way that, during shutdown, it effectively becomes a no-op because the compaction_manager has already initiated the stopping of tasks. As a result the caller assumes that all the compaction tasks have stopped and proceeds to close all the tables. This can lead to race conditions where table closures overlap with compaction tasks that are still running, resulting in exceptions like : ``` exception during mutation write to 127.0.0.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{0010b70d31705e0411efb2edf6467f094c8b}) to commitlog): seastar::gate_closed_exception (gate closed) ``` This commit fixes the issue by updating `compaction_manager::drain` to invoke `stop_ongoing_compactions` even during shutdown to ensure that it waits for the ongoing compaction tasks to complete. The `stop_ongoing_compactions` method will also send a stop request to these tasks before waiting, but the request will be ignored by the tasks as they would have already received one earlier from `really_do_stop()`. Fixes #20197 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#20715	2024-10-09 12:08:32 +03:00
Piotr Smaron	e0c1a51642	cql/tablets: handle MVs in ALTER tablets KEYSPACE ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized views (MV), and only produced tablets mutations changing tables. With this patch we're producing tablets mutations for both tables and MVs, hence when e.g. we change the replication factor (RF) of a KS, both the tables' RFs and MVs' RFs are updated along with tablets replicas. The `test_tablet_rf_change` testcase has been extended to also verify that MVs' tablets replicas are updated when RF changes. Fixes: #20240 Closes scylladb/scylladb#21007	2024-10-09 10:51:18 +02:00
Piotr Smaron	ee56bbfe61	cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS Tablets load balancer is unable to process more than a single pending replica, thus ALTER tablets KS cannot accept an ALTER statement which would result in creating 2+ pending replicas, hence it has to validate if the sum of absoulte differences of RFs specified in the statement is not greter than 1.	2024-10-07 17:02:50 +02:00
Piotr Smaron	2aabe7f09c	cql: join new and old KS options in ALTER tablets KS A bug has been discovered while trying to ALTER tablets KS and specifying only 1 out of 2 DCs - the not specified DC's RF has been zeroed. This is because ALTER tablets KS updated the KS only with the RF-per-DC mapping specified in the ALTER tablets KS statement, so if a DC was ommitted, it was assigned a value of RF=0. This commit fixes that plus additionally passes all the KS options, not only the replication options, to the topology coordinator, where the KS update is performed. `initial_tablets` is a special case, which requires a special handling in the source code, as we cannot simply update old initial_tablet's settings with the new ones, because if only ` and TABLETS = {'enabled': true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but rather keep the old value - this is tested by the `test_alter_preserves_tablets_if_initial_tablets_skipped` testcase. Other than that, the above mentioned testcase started to fail with these changes, and it appeared to be an issue with the test not waiting until ALTER is completed, and thus reading the old value, hence the test's body has been modified to wait for ALTER to complete before performing validation.	2024-10-07 17:02:45 +02:00
Piotr Smaron	47acdc1f98	cql: validate RF change for new DCs in ALTER tablets KS ALTER tablets KS validated if RF is not changed by more than 1 for DCs that already had replicas, but not for DCs that didn't have them yet, so specifying an RF jump from 0 to 2 was possible when listing a new DC in ALTER tablets KS statement, which violated internal invariants of tablets load balancer. This PR fixes that bug and adds a multi-dc testcases to check if adding replicas to a new DC and removing replicas from a DC is honoring the RF change constraints. Refs: #20039	2024-10-07 16:02:01 +02:00
Pavel Emelyanov	1dfe780457	cql: Check that CREATEing tablets/vnodes is consistent with the CLI There are two bits that control whenter replication strategy for a keyspace will use tablets or not -- the configuration option and CQL parameter. This patch tunes its parsing to implement the logic shown below: if (strategy.supports_tablets) { if (cql.with_tablets) { if (cfg.enable_tablets) { return create_keyspace_with_tablets(); } else { throw "tablets are not enabled"; } } else if (cql.with_tablets = off) { return create_keyspace_without_tablets(); } else { // cql.with_tablets is not specified if (cfg.enable_tablets) { return create_keyspace_with_tablets(); } else { return create_keyspace_without_tablets(); } } } else { // strategy doesn't support tablets if (cql.with_tablets == on) { throw "invalid cql parameter"; } else if (cql.with_tablets == off) { return create_keyspace_without_tablets(); } else { // cql.with_tablets is not specified return create_keyspace_without_tablets(); } } closes: #20088 In order to enable tablets "by default" for NetworkTopologyStrategy there's explicit check near ks_prop_defs::get_initial_tablets(), that's not very nice. It needs more care to fix it, e.g. provide feature service reference to abstract_replication_strategy constructor. But since ks_prop_defs code already highjacks options specifically for that strategy type (see prepare_options() helper), it's OK for now. There's also #20768 misbehavior that's preserved in this patch, but should be fixed eventually as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#20779	2024-10-01 10:54:29 +02:00
Benny Halevy	23d6b996b8	test/pylib: scylla_cluster: set endpoint_snitch in scylla conf When `property_file` is provided, we generate a `cassandra-rackdc.properties` file, but to actually use it, `endpoint_snitch` must be set to `GossipingPropertyFileSnitch`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#20730	2024-09-27 16:46:54 +03:00
Kamil Braun	9224e48d6b	Merge 'Populate raft address map from gossiper on raft configuration change' from Gleb Natapov For each new node added to the raft config populate its ID to IP mapping in raft address map from the gossiper. The mapping may have expired if a node is added to the raft configuration long after it first appears in the gossiper. Fixes scylladb/scylladb#20600 Backport to all supported versions since the bug may cause bootstrapping failure. Closes scylladb/scylladb#20601 * github.com:scylladb/scylladb: test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join group0: make sure that address map has an entry for each new node in the raft configuration	2024-09-26 12:41:25 +02:00
Gleb Natapov	9e4cd32096	test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join	2024-09-25 17:10:09 +03:00
Kamil Braun	7d8f1d251a	Merge 'Mark node as being replaced earlier' from Gleb Natapov Before `17f4a151ce` the node was marked as been replaced in join_group0 state, before it actually joins the group0, so by the time it actually joins and starts transferring snapshot/log no traffic is sent to it. The commit changed this to mark the node as being replaced after the snapshot/log is already transferred so we can get the traffic to the node while it sill did not caught up with a leader and this may causes problems since the state is not complete. Mark the node as being replaced earlier, but still add the new node to the topology later as the commit above intended. Fixes: scylladb/scylladb#20629 Need to be backported since this is a regression Closes scylladb/scylladb#20743 * github.com:scylladb/scylladb: test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts topology coordinator:: mark node as being replaced earlier topology coordinator: do metadata barrier before calling finish_accepting_node() during replace	2024-09-25 15:46:12 +02:00
Kamil Braun	69b4769418	test: fix `topology_custom/test_raft_recovery_stuck` flakiness The test performs consecutive schema changes in RECOVERY mode. The second change relies on the first. However the driver might route the changes to different servers and we don't have group 0 to guarantee linearizability. We must rely on the first change coordinator to push the schema mutations to other servers before returning, but that only happens when it sees other servers as alive when doing the schema change. It wasn't guaranteed in the test. Fix this. Fixes scylladb/scylladb#20791 Should be backported to all branches containing this test to reduce flakiness. Closes scylladb/scylladb#20792	2024-09-25 08:45:37 +03:00
Gleb Natapov	1213f02a5a	test: skip test_lwt_semaphore::test_cas_semaphore in aarch64 debug mode The test configures write timeout to much smaller value to make the test run faster since for some writes sleep is inserted to hit the timeout, but it makes aarch64 debug flaky since timeout happens when it should not because of a natural slowness. Fixes scylladb/scylladb#20515 Closes scylladb/scylladb#20744	2024-09-23 20:46:55 +02:00
Avi Kivity	6f7c2ce0aa	Merge 'cql_server::connection: Process rebounce message in case of multiple shard migrations' from Sergey Zolotukhin During a query execution, the query can be re-bounced to another shard if the requested data is located there. Previous implementation assumed that the shard cannot be changed after first re-bounce, however with the introduction of Tablets, data could be migrated to another shard after the query was already re-bounced, causing a failure of the query execution. To avoid this issue, the query is re-bounced as needed until it is executed on the correct shard. Fixes #15465 Closes scylladb/scylladb#20493 * github.com:scylladb/scylladb: cql_server: Add a test for multiple query msg rebounces. cql_server::connection: process: rebounce msg if needed cql_server::connection: process: co-routinize connection::process_on_shard cql_server: connection: process: fixup indentation cql_server: connection: process_on_shard: drop permit parameter transport: server: pass bounce_to_shard as foreign shared ptr cql_server: connection: process: add template concept for process_fn cql_server: move process_fn_return_type to class definition	2024-09-19 17:27:55 +03:00
Gleb Natapov	1b4c255ffd	test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts	2024-09-19 15:24:59 +03:00
Sergey Zolotukhin	68740f57c2	cql_server: Add a test for multiple query msg rebounces. The test emulates several LWT(Lightweight Transaction) query rebounces. Currently, the code that processes queries does not expect that a query may be rebounced more than once. It was impossible with the VNodes, but with intruduction of the Tablets, data can be moved between shards by the balancer thus a query can be rebounced to different shards multiple times.	2024-09-17 15:19:56 +02:00
Alexey Novikov	8b6e987a99	test: add test_pinned_cl_segment_doesnt_resurrect_data add test for issue when writes in commitlog segments pinned to another table can be resurrected. This test based on dtest code published in #14870 and adapted for community version. It's a regression test for #15060 fix and should fail before this patch and succeed afterwards. Refs #14870, #15060 Closes scylladb/scylladb#20331	2024-09-12 10:58:22 +03:00
Piotr Dulikowski	d98708013c	Merge 'view: move view_build_status to group0' from Michael Litvak Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations. The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table. New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table. The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility. When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows. Fixes scylladb/scylladb#15329 dtest https://github.com/scylladb/scylla-dtest/pull/4827 Closes scylladb/scylladb#19745 * github.com:scylladb/scylladb: view: test view_build_status table with node replace test/pylib: use view_build_status_v2 table in wait_for_view view_builder: common write view_build_status function view_builder: improve migration to v2 with intermediate phase view: delete node rows from view_build_status on node removal view: sanitize view_build_status during migration view: make old view_build_status table a virtual table replica: move streaming_reader_lifecycle_policy to header file view_builder: test view_build_status_v2 storage_service: add view_build_status to raft snapshot view_builder: migration to v2 db:system_keyspace: add view_builder_version to scylla_local view_builder: read view status from v2 table view_builder: introduce writing status mutations via raft view_builder: pass group0_client and qp to view_builder view_builder: extract sys_dist status operations to functions db:system_keyspace: add view_build_status_v2 table	2024-09-11 13:02:58 +02:00
Lakshmi Narayanan Sreethar	a0f4fe3fc4	cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-09-10 16:39:05 +05:30
Avi Kivity	9448260b30	Merge 'major compaction: check only sstables being compacted for tombstone garbage collection' from Lakshmi Narayanan Sreethar Any expired tombstone can be garbage collected if it doesn't shadow data in the commit log, memtable, or uncompacting SSTables. This PR introduces a new mode to major compaction, enabled by the `consider_only_existing_data` flag that bypasses these checks. When enabled, memtables and old commitlog segments are cleared with a system-wide flush and all the sstables (after flush) are included in the compaction, so that it works with all data generated up to a given time point. This new mode works with the assumption that newly written data will not be shadowed by expired tombstones. So it ignores new sstables (and new data written to memtable) created after compaction started. Since there was a system wide flush, commitlog checks can also be skipped when garbage collecting tombstones. Introducing data shadowed by a tombstone during compaction can lead to undefined behavior, even without this PR, as the tombstone may or may not have already been garbage collected. Fixes #19728 Closes scylladb/scylladb#20031 * github.com:scylladb/scylladb: cql-pytest: add test to verify consider_only_existing_data compaction option tools/scylla-nodetool: add consider-only-existing-data option to compact command api: compaction: add `consider_only_existing_data` option compaction: consider gc_check_only_compacting_sstables when deducing max purgeable timestamp compaction: do not check commitlog if gc_check_only_compacting_sstables is enabled tombstone_gc_state: introduce with_commitlog_check_disabled() compaction: introduce new option to check only compacting sstables for gc compaction: rename maybe_flush_all_tables to maybe_flush_commitlog compaction: maybe_flush_all_tables: add new force_flush param	2024-09-09 20:45:41 +03:00
Piotr Dulikowski	6f3d0af994	test: topology_custom/test_hints: consistency test for decommission Adds the test_hints_consistency_during_decommission test which reproduces the failure observed in scylladb/scylla-dtest#4582. It uses error injections, including the newly added topology_coordinator_pause_after_streaming injection, to reliably orchestrate the scenario observed there. In a nutshell, the test makes sure to replay hints after streaming during decommission has finished, but before the cluster switches to reading from new replicas. Without the fix, hints would be replayed to the decommissioned node and then would be lost forever after the cluster start reading from new replicas.	2024-09-08 10:51:38 +02:00
Piotr Dulikowski	30d53167c9	test: topology_custom/test_hints: move sync point helpers to top level Move create_sync_point and await_sync_point from the scope of the test_sync_point test to the file scope. They will be used in a test that will be introduced in the commit that follows.	2024-09-08 10:51:38 +02:00
Kamil Braun	52fdf5b4c9	test: test_raft_no_quorum: increase raft timeout in debug mode The test cases in this file use an error injection to reduce raft group 0 timeouts (from the default 1 minute), in order to speed up the tests; the scenarios expect these timeouts to happen, so we want them to happen as quick as possible, but we don't want to reduce timeouts so much that it will make other operations fail when we don't expect them to (e.g. when the test wants to add a node to the cluster). Unfortunately the selected 5 seconds in debug mode was not enough and made the tests flaky: scylladb/scylladb#20111. Increase it to 10 seconds. This unfortunately will slow down these tests as they have to sometimes wait for 10 seconds for the timeout to happen. But better to have this than a flaky test. Fixes: scylladb/scylladb#20111 Closes scylladb/scylladb#20320	2024-09-06 11:40:09 +03:00
Michael Litvak	9545e0a114	view: test view_build_status table with node replace Add a test replacing a node and verifying the contents of the view_build_status table are updated as expected, having rows for the new node and no rows for the old node.	2024-09-05 15:42:35 +03:00
Michael Litvak	3ca5dd537f	test/pylib: use view_build_status_v2 table in wait_for_view Change the util function wait_for_view to read the view build status from the system.view_build_status_v2 table which replaces system_distributed.view_build_status. The old table can still be used but it is less efficient because it's implemented as a virtual table which reads from the v2 table, so it's better to read directly from the v2 table. This can cause slowness in tests. The additional util function wait_for_view_v1 reads from the old table. This may be needed in upgrade tests if the v2 table is not available yet.	2024-09-05 15:42:35 +03:00
Michael Litvak	c1f3517a75	view_builder: improve migration to v2 with intermediate phase Add an intermediate phase to the view builder migration to v2 where we write to both the old and new table in order to not lose writes during the migration. We add an additional view builder version v1_5 between v1 and v2 where we write to both tables. We perform a barrier before moving to v2 to ensure all the operations to the old table are completed.	2024-09-05 15:42:35 +03:00
Michael Litvak	446ad3c184	view: delete node rows from view_build_status on node removal When a node is removed we want to clean its rows from the view_build_status table. Now when removing a node and generating the topology state update, we generate also the mutations to delete all the possible rows belonging to the node from the table.	2024-09-05 15:42:35 +03:00
Michael Litvak	08462aaff7	view: sanitize view_build_status during migration When migrating the view_build_status to v2, skip adding any leftover rows that don't correspond to an existing node or an existing view. Previously such rows could have been created and not cleaned, for example when a node is removed.	2024-09-05 15:42:35 +03:00
Michael Litvak	78d6ff6598	view: make old view_build_status table a virtual table After migrating the view build status from system_distributed.view_build_status to system.view_build_status_v2, we set system_distributed.view_build_status to be a virtual table, such that reading from it is actually reading from the underlying new table. The reason for this is that we want to keep compatibility with the old table, since it exists also in Cassandra and it is used by various external tools to check the view build status. Making the table virtual makes the transition transparent for external users. The two tables are in different keyspaces and have different shard mapping. The v1 table is a distributed table with a normal shard mapping, and the v2 table is a local table using the null sharder. The virtual reader works by constructing a multishard reader which reads the rows from shard zero, and then filtering it to get only the rows owned by the current shard.	2024-09-05 15:42:35 +03:00
Michael Litvak	22f4f1fa49	view_builder: test view_build_status_v2 Add tests to verify the new view_build_status_v2 is used by the view_builder and can be read from all nodes with the expected values. Also test a migration from the v1 layout to v2.	2024-09-05 15:42:35 +03:00
Wojciech Mitros	c1b0434c16	test: finish mv view update explicitly instead of relying on delay duration When testing mv admission control, we perform a large view update and check if the following view update can be admitted due to the high view backlog usage. We rely on a delay which keeps the backlog high for longer to make sure the backlog is still increased during the second write. However, in some test runs the delay is not long enough, causing the second write to miss the large backlog and not hit admission control. In this patch we keep the increased backlog high using another injection instead of relying on a delay to make absolute sure that the backlog is still high during the second write. Fixes scylladb/scylladb#20382 Closes scylladb/scylladb#20445	2024-09-05 15:08:04 +03:00
Lakshmi Narayanan Sreethar	7c5efab7d5	cql-pytest: add test to verify consider_only_existing_data compaction option Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-09-05 17:34:13 +05:30
Łukasz Paszkowski	20a6296309	test: Add reversed query tests on simulated upgrade process Run the reversed queries on a 2-node cluster with CL=ALL with and without NATIVE_REVERSE_QUERIES feature flag. When the flag is enabled, the native reversed format is used, otherwise the legacy format. The NATIVE_REVERSE_QUERIES feature flag is suppressed with an error injection that simulates cluster upgrade process. Backport is not required. The patch adds additional upgrade tests for https://github.com/scylladb/scylladb/pull/18864 Closes scylladb/scylladb#20179	2024-09-03 14:45:08 +03:00
Kamil Braun	292ef0d1f9	Merge 'Fix node replace with inter-dc encryption enabled.' from Gleb Natapov Currently if a coordinator and a node being replaced are in the same DC while inter-dc encryption is enabled (connections between nodes in the same DC should not be encrypted) the replace operation will fail. It fails because a coordinator uses non encrypted connection to push raft data to the new node, but the new node will not accept such connection until it knows which DC the coordinator belongs to and for that the raft data needs to be transferred. The series adds the test for this scenario and the fix for the chicken&egg problem above. The series (or at least the fix itself) needs to be backported because this is a serious regression. Fixes: scylladb/scylladb#19025 Closes scylladb/scylladb#20290 * github.com:scylladb/scylladb: topology coordinator: fix indentation after the last patch topology coordinator: do not add replacing node without a ring to topology test: add test for replace in clusters with encryption enabled test.py: add server encryption support to cluster manager .gitignore: fix pattern for resources to match only one specific directory	2024-08-30 11:29:05 +02:00
Pavel Emelyanov	cec4d207f6	Merge 'repair: throw if batchlog manager isn't initialized' from Aleksandra Martyniuk repair_service::repair_flush_hints_batchlog_handler may access batchlog manager while it is uninitialized. Throw if batchlog manager isn't initialized. Fixes: #20236. Needs backport to 6.0 and 6.1 as they suffer from the uninitialized bm access. Closes scylladb/scylladb#20251 * github.com:scylladb/scylladb: test: add test to ensure repair won't fail with uninitialized bm repair: throw if batchlog manager isn't initialized	2024-08-30 11:37:24 +03:00
Gleb Natapov	17f4a151ce	topology coordinator: do not add replacing node without a ring to topology When only inter dc encryption is enabled a non encrypted connection between two nodes is allowed only if both nodes are in the same dc. If a nodes that initiates the connection knows that dst is in the same dc and hence use non encrypted connection, but the dst not yet knows the topology of the src such connection will not be allowed since dst cannot guaranty that dst is in the same dc. Currently, when topology coordinator is used, a replacing node will appear in the coordinator's topology immediately after it is added to the group0. The coordinator will try to send raft message to the new node and (assuming only inter dc encryption is enabled and replacing node and the coordinator are in the same dc) it will try to open regular, non encrypted, connection to it. But the replacing node will not have the coordinator in it's topology yet (it needs to sync the raft state for that). so it will reject such connection. To solve the problem the patch does not add a replacing node that was just added to group0 to the topology. It will be added later, when tokens will be assigned to it. At this point a replacing node will already make sure that its topology state is up-to-date (since it will execute a raft barrier in join_node_response_params handler) and it knows coordinator's topology. This aligns replace behaviour with bootstrap since bootstrap also does not add a node without a ring to the topology. The patch effectively reverts `b8ee8911ca` Fixes: scylladb/scylladb#19025	2024-08-29 17:14:09 +03:00
Gleb Natapov	2f1b1fd45e	test: add test for replace in clusters with encryption enabled	2024-08-29 17:14:09 +03:00
Patryk Jędrzejczak	95e14ae44b	test: test zero-token nodes We add tests to verify the basic properties of zero-token nodes. `test_zero_token_nodes_no_replication` and `test_not_enough_token_owners` are more or less deterministic tests. Running them only in the dev mode is sufficient. `test_zero_token_nodes_topology_ops` is quite slow, as expected, considering parameterization and the number of topology operations. In the future we can think of making it faster or skipping in the debug mode. For now, our priority is to test zero-token nodes thoroughly.	2024-08-29 10:37:07 +02:00

1 2 3 4 5 ...

251 Commits