scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Ferenc Szili	dc375b8cd3	test: enable test_truncate_with_coordinator_crash This test was added in PR #19789 but was disabled with xfail because of the bug with way truncate saved the commit log replay positions. More specifically, the replay positions for shards that had no mutations were saved to system.truncated with shard_id == 0, regardless for which shard it was actually saved for (see #21719). The bug was fixed in #21722, so this change removes the xfail tag from the test. Closes scylladb/scylladb#21902	2024-12-18 18:02:52 +01:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Avi Kivity	5a849b0a6a	Merge "Move more subsystems to use host ids instead of ips" from Gleb " This series converts repair, streaming and node_ops (and some parts of alternator) to work on host ids instead of ips. This allows to remove a lot of (but not all) functions that work on ips from effective replication map. CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13830/ Refs: scylladb/scylladb#21777 " * 'gleb/move-to-host-id-more' of github.com:scylladb/scylla-dev: locator: topology: remove no longer use get_all_ips() gossiper: change get_unreachable_nodes to host ids locator: drop no longer used ip based functions from effective replication map and friends test: move network_topology_strategy_test and token_metadata_test to use host id based APIs replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges replica/mutation_dump: use host ids instead of ips alternator: move ttl to work with host ids instead of ips storage_service: move node_ops code to use host ids instead of host ips streaming: move streaming code to use host ids instead of host ips repair: move repair code to use host ids instead of host ips gossiper: add get_unreachable_host_ids() function locator: topology: add more function that return host ids to effective replication map locator: add more function that return host ids to effective replication map	2024-12-18 13:48:22 +02:00
Piotr Dulikowski	d067d8caef	Merge 'More Python tests for materialized view and Alternator GSI feature' from Nadav Har'El This patch includes more tests (in Python) that I wrote while implementing the Alternator UpdateTable feature for adding a GSI to an existing table (https://github.com/scylladb/scylladb/issues/11567). I explain each of these tests in the separate patches below, but basically they fall into two types: 1. Tests which pass with today's materialized views and Alternator GSI/LSI, and serve to ensure that whatever changes I do to the view update implementation, doesn't break corner cases that already worked. 2. Tests for the UpdateTable feature in Alternator which doesn't work today so xfail - and will need to work for #11567. We already had a few tests for this, but here I add more and improve coverage of various corner cases I discovered while implementing the featue. I already have a working prototype for #11567 which passes all these tests. Many of these tests helped exposed various bugs in earlier versions of my code. Closes scylladb/scylladb#21927 * github.com:scylladb/scylladb: test/cqlpy: a few more functional tests for materialized views test/alternator: more tests for UpdateTable create and delete GSI test/alternator: make UpdateTable tests wait less test/alternator: move UpdateTable tests to a separate file test/alternator: add another test for elaborate GSI updates test/alternator: test that DescribeTable returns IndexStatus for GSI test/alternator: fix wrong test for UpdateTable metrics test/alternator: add test for missing attribute in item in LSI test/alternator: test that DescribeTable doesn't return IndexStatus for LSI test/alternator: add tests for RBAC for create and delete GSI	2024-12-17 20:43:07 +01:00
Avi Kivity	01cdba9a98	Merge 'cache_algorithm_test: fix flaky failures' from Michał Chojnowski This series attempts to get read of flakiness in `cache_algorithm_test` by solving two problems. Problem 1: The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form: 0x0000000000000000000000000000000000000000... 0x0100000000000000000000000000000000000000... 0x0200000000000000000000000000000000000000... But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar... 0x0100000000000000garbagegarbagegarbagegar... 0x0200000000000000garbagegarbagegarbagegar... Each of these keys is created several times and -- for the test to pass -- the result must be the same each time. By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses. Problem 2: Cache stats are global, so there's no good way to reliably verify that e.g. a given read causes 0 cache misses, because something done by Scylla in a background can trigger a cache miss. This can cause the test to fail spuriously. With how the test framework and the cache are designed, there's probably no good way to test this properly. It would require ensuring that cache stats are per-read, or at least per-table, and that Scylla's background activity doesn't cause enough memory pressure to evict the tested rows. This patch tries to deal with the flakiness without deleting the test altogether by letting it retry after a failure if it notices that it can be explained by a read which wasn't done by the test. (Though, if the test can't be written well, maybe it just shouldn't be written...) Fixes #21536 Should be backported to prevent flaky failures in older branches. Closes scylladb/scylladb#21948 * github.com:scylladb/scylladb: cache_algorithm_test: harden against stats being confused by background activity cache_algorithm_test: fix a use of an uninitialized variable	2024-12-17 14:46:43 +02:00
Botond Dénes	73fc135e02	Merge 'test.py: make sure topology/ and topology_custom/ passes with tablets on.' from Konstantin Osipov Explicitly disable tablets in a few tests that rely on features not yet supported with tablets. Closes scylladb/scylladb#21070 * github.com:scylladb/scylladb: test: disable tablets in test_raft_fix_broken_snapshot test: disable tablets in test_raft_recovery_stuck test: disable tablets in tet_raft_recovery_majority_lost test: don't run test_raft_recovery_basic with tablets test: fix test_writes_to_previous_cdc_generations work with tablets test: fix topology_custom/test_mv_topology_change.py to work with tablets test: correct replication factor in test_multidc.py test: update test_view_build_status to work with tablets test: fix test_change_rpc_address with tablets. test: explicitly disable tablets in test_gropu0_schema_versioning test: disable tablets in topology/test_mutation_schema_change.py test: disable tablets in topology/test_mv.py	2024-12-17 08:38:10 +02:00
Aleksandra Martyniuk	d0cda8ebef	replica: check enabled features in tablet_map_to_mutation Before adding a value to a new column in tablet_map_to_mutation check if the column is supported by the whole cluster. Closes scylladb/scylladb#21941	2024-12-17 07:02:11 +02:00
Michał Chojnowski	6caaead4ac	cache_algorithm_test: harden against stats being confused by background activity Cache stats are global, so there's no good way to reliably verify that e.g. a given read causes 0 cache misses, because something done by Scylla in a background can trigger a cache miss. This can cause the test to fail spuriously. With how the test framework and the cache are designed, there's probably no good way to test this properly. It would require ensuring that cache stats are per-read, or at least per-table, and that Scylla's background activity doesn't cause enough memory pressure to evict the tested rows. This patch tries to deal with the flakiness without deleting the test altogether by letting it retry after a failure if it notices that it can be explained by a read which wasn't done by the test. (Though, if the test can't be written well, maybe it just shouldn't be written...)	2024-12-16 23:14:30 +01:00
Michał Chojnowski	1fffd976a4	cache_algorithm_test: fix a use of an uninitialized variable The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form: 0x0000000000000000000000000000000000000000... 0x0100000000000000000000000000000000000000... 0x0200000000000000000000000000000000000000... But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar... 0x0100000000000000garbagegarbagegarbagegar... 0x0200000000000000garbagegarbagegarbagegar... Each of these keys is created several times and -- for the test to pass -- the result must be the same each time. By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.	2024-12-16 23:14:13 +01:00
Nadav Har'El	99e7fdef6d	test/cqlpy: a few more functional tests for materialized views This patch adds a few more functional tests for the CQL materialized view feature in the cqlpy. The new tests pass, but helped me catch bugs (and understand what are not bugs) while refactoring some view update code. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 19:36:47 +02:00
Nadav Har'El	d9af154772	test/alternator: more tests for UpdateTable create and delete GSI We already have in test_gsi_updatetable.py several functional tests for the Alternator feature of adding or deleting a GSI on an existing table, through the UpdateTable operation. This patch adds many more tests for various corner cases of this feature - tests developed in parallel with actually implementing that feature. All test in test_gsi_updatetable.py pass on Amazon DynamoDB but currently xfail on Alternator, due to the following issues: * #11567: Alternator: allow adding a GSI to a pre-existing table * #9424: Alternator GSIs should exclude items with empty-string key components Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 19:36:47 +02:00
Nadav Har'El	5c7b8c8e4d	test/alternator: make UpdateTable tests wait less The UpdateTable tests for creating and deleting a GSI need to wait for the asynchronous operation of the view's building and deletion, using two utility functions wait_for_gsi() and wait_for_gsi_gone(). Because I originally wrote these tests for DynamoDB and its extremely high latency for these operations, these functions waited a whole second before checking for the end of the wait. This whole-second sleep is absurd in Alternator where building a small view takes just a fraction of a second. So let's lower the sleep time from 1 second to 0.1 seconds, and allow these tests to pass much faster on Alternator (once this feature is implemented in Alternator, of course - until then all these tests still fail immediately on an unimplemented operation). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 19:36:47 +02:00
Nadav Har'El	b1bd5cdf0f	test/alternator: move UpdateTable tests to a separate file The source file test/alternator/test_gsi.py has already grown very large, so this patch moves all the existing tests related to using UpdateTable to add or delete a GSIs to a separate file: test_gsi_updatetable.py. We just move tests here - no new tests or functional changes to the tests - but did use the opportunity for some small improvements in the comments. In the next patch we'll add more tests to this new file. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 19:36:47 +02:00
Nadav Har'El	cc308bd0cc	test/alternator: add another test for elaborate GSI updates We have a test, test/alternator/test_gsi.py::test_update_gsi_pk which created a GSI whose partition key was a regular column in the base table, and exercised various elaborate updates requiring adding, updating and deleting of rows from the materialized view. In this patch, we add another similar test case, just for a clustering key. Both these tests are important regression tests - when we later reimplement GSI we'll want to verify that none of the complex update scenarios got broken (and indeed, some broken code did break these tests). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 18:56:28 +02:00
Nadav Har'El	9094fe1608	test/alternator: test that DescribeTable returns IndexStatus for GSI This patch adds a test reproducing issue #11471 - where DescribeTable on a table that as an already built GSI (creating with the table itself) must return IndexStatus == "ACTIVE". This test passes on DynamoDB, but xfails on Alternator because of issue #11471. We actually had this check earlier, but it was part of a bigger xfailing tests that checked multiple features. It's better to have it as a separate test just for this feature, as we'll soon fix this issue and make this test pass. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 18:56:28 +02:00
Nadav Har'El	1b120e3c7e	test/alternator: fix wrong test for UpdateTable metrics The test we had for counting Alternator operations metrics ran the UpdateTable request without any parameters, which isn't actually a valid call - Amazon DynamoDB rejects such a call, saying one of the different parameters must be present, and we'll want to do that later too. So let's fix the test to use a valid UpdateTable request, one that does the silly BillingMode='PAY_PER_REQUEST'. This is already the current setting, so nothing is really changed, but it's still counted as an operation in the metric. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 18:56:28 +02:00
Nadav Har'El	85088516b2	test/alternator: add test for missing attribute in item in LSI Test that when a table has an LSI, then if the indexed attribute is missing, the item is added to the base table but not the index. We already have exactly the same test for GSI in test_gsi.py, but forgot to do write the same test for LSI. It's important to test this scenario separately for GSIs and LSIs because in an upcoming GSI reimplementation we plan to make the GSI and LSI implementation slightly different, and they can have separate bugs (and in fact, we had such an LSI-specific bug in one broken implementation). We also have the same scenario that is tested here in the test test_streams.py::test_streams_updateitem_old_image_lsi_missing_column but that was a Alternator Streams test and we should have a more basic test for this scenario in test_lsi.py. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 18:56:28 +02:00
Nadav Har'El	b00f5a6070	test/alternator: test that DescribeTable doesn't return IndexStatus for LSI Whereas GSIs have an IndexStatus when described by DescribeTable, LSIs do not. The purpose of IndexStatus is to tell when the index is live, and this is not needed for LSIs because they cannot be added to a base table that already exists. We already had a test for this, but it was hidden in an xfailing test for many different DescribeTable attributes - so let's move it into it's own, passing, test. The new tests passes on both Alternator and Amazon DynamoDB. This test is an important regression test for when we later add IndexStatus support to GSI, and this test will ensure that we don't accidentally introduce IndexStatus to LSIs as well - DynamoDB doesn't generate it for LSIs so neither should Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 18:56:28 +02:00
Nadav Har'El	373b37b5da	test/alternator: add tests for RBAC for create and delete GSI In later patches we will implement (as requested in issue #11567) the UpdateTable operation for creating a new GSI or removing a GSI on an existing table. In this patch we add to test/alternator/test_cql_rbac.py tests to exhaustively check that the new operations will behave as expected in respect to role-based access control (RBAC): 1. UpdateTable requires the ALTER permissions on the affected table - as was already the case before (and was documented in compatibility.md). This should also be true for the newly-implemented UpdateTable operations that create a GSI and delete a GSI, and we test that. The above statement may sound counter-intuitive - why does creating or deleting a GSI require ALTER permissions (on the base table), not CREATE or DROP permissions? But this makes sense when you consider that CREATE permissions should allow you create new independent tables, not to change the behavior or performance of existing tables (which adding a GSI does). 2. When a role has permissions to create a GSI, it should be able to read the new GSI (SELECT permissions). This is known as "auto-grant". 3. When a GSI is deleted, whatever permissions was set on it is revoked, so that if it's later recreated, the old permissions don't resurface. This is known as "auto-revoke". Because the UpdateTable feature for creating and deleting a GSI is not yet enabled, the new tests are all marked "xfail". The new tests, like all tests in the file test/alternator/test_cql_rbac.py are Scylla-only and are skipped on Amazon DynamoDB - because they test the Scylla-only CQL-based role-based access control API. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-16 18:55:28 +02:00
Konstantin Osipov	686c0e517f	test: disable tablets in test_raft_fix_broken_snapshot The test is using force_gossip_topology_changes which doesn't work with tablets.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	bba034202d	test: disable tablets in test_raft_recovery_stuck The test is using force_gossip_topology_mode which doesn't work with tablets.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	3767a54696	test: disable tablets in tet_raft_recovery_majority_lost The test is using force_gossip_topology_mode which doesn't work with tablets.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	e961d692e6	test: don't run test_raft_recovery_basic with tablets It uses force_gossip_topology_changes, which doesn't work with tablets.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	d6fc0d5512	test: fix test_writes_to_previous_cdc_generations work with tablets The test is testing CDC. CDC doesn't work with tablets. Explicitly disable tablets in the keyspaces used by the test.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	169c2e62b8	test: fix topology_custom/test_mv_topology_change.py to work with tablets test_mv_topology_change runs in gossip mode, so disable tablets as well.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	ff43f8d9f6	test: correct replication factor in test_multidc.py In tablets mode, it is not allowed to CREATE a table if replication factor can be satisfied. E.g. if the keyspace is defined to have replication_factor = 3 and there are only 2 replicas, in vnodes mode one still can CREATE the table and write to it, whereas in tablets mode one gets an error. The confusion is what 'replication_factor' means. When NetworkTopologyStrategy is used, in multi-dc mode, each DC must have at least 'replication_factor' replicas and stores 'replication_factor' copies of data. The test author (as well as the author of this "fix", see my confused report of gh-21166) assumed that 'replication_factor' means the total number of replicas, not the number of replicas per DC. Correct the test to use only one replica per DC, as this is the topology the test is working with. The test is not specific to the number of replicas, so the change does not impact the logic of the test.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	1e582b4c0f	test: update test_view_build_status to work with tablets The test runs a bunch of tests in gossip only mode, which doesn't work with tablets, so disable tablets explicitly in these tests.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	3e55f1c033	test: fix test_change_rpc_address with tablets. With tablets, it's not allowed to create a table in a keyspace which replication factor exceeds the actual number of nodes in the cluster. Pass the replication factor to random_tables fixture so that a keyspace with a correct replication_factor is created.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	4b10c10c1b	test: explicitly disable tablets in test_gropu0_schema_versioning This is a gossip-based topology changes test, and tablets don't work with gossip based topology.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	4aa7dca862	test: disable tablets in topology/test_mutation_schema_change.py This test uses lightweight transactions, which are not enabled with tablets keyspaces.	2024-12-16 08:38:05 -05:00
Konstantin Osipov	2866b4f550	test: disable tablets in topology/test_mv.py The test file contains two test cases, which both test materialized view tombstone gc settings. With tablets the default is "repair" which is different from vnodes. The tests are testing that the gc settings are not inherited. With tablets, the gc settings are forced. This is indistinguishable from inheriting, so the tests are failing when run with tablets.	2024-12-16 08:38:05 -05:00
Kefu Chai	f2638c3d18	test: topology_custom: restrcuture comment as ordered list When investigating issue #21724, the docstring for `test_recover_stuck_raft_recovery` was found to be difficult to follow. Restructured the docstring into an ordered list to: 1. Improve readability 2. Clearly outline the test steps 3. Make the test's logic and flow more immediately comprehensible Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21728	2024-12-16 14:30:13 +02:00
Pavel Emelyanov	7db9132b56	test: Add validation of getting/changing compaction strategy via REST API The /column_family/compaction_strategy has GET and POST implemented, the latter changes the strategy on the table. Unknown strategy name implicitly renders internal server error code by catching exception from compaction_strategy::type() that tries to convert strategy name string to strategy enum class type. This is to finish validation of #21533 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#21569	2024-12-16 14:28:23 +02:00
Botond Dénes	34a8b492be	Merge 'materialized view: make flow-control maximum delay configurable' from Piotr Dulikowski This pull request is continuation of scylladb/scylladb#20688 - contents of the main commit are the same, the only change is the additional commit with a test. Until this patch, the materialized view flow-control algorithm (https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/) used a constant delay_limit_us hard-coded to one second, which means that when the size of view-update backlog reached the maximum (10% of memory), we delay every request by an additional second - while smaller amounts of backlog will result in smaller delays. This hard-coded one maximum second delay was considered huge - it will slow down a client with concurrency 1000 to just 1000 requests per second - but we already saw some workloads where it was not enough - such as a test workload running very slow reads at high concurrency on a slow machine, where a latency of over one second was expected for each read, so adding a one second latecy for writes wasn't having any noticable affect on slowing down the client. So this patch replaces the hard-coded default with a live-updateable configuration parameter, `view_flow_control_delay_limit_in_ms`, which defaults to 1000ms as before. Another useful way in which the new `view_flow_control_delay_limit_in_ms` can be used is to set it to 0. In that case, the view-update flow control always adds zero delay, and in effect - does absolutely nothing. This setting can be used in emergency situations where it is suspected that the MV flow control is not behaving properly, and the user wants to disable it. The new parameter's help string mentions both these use cases of the parameter. Fixes #18187 This is new functionality, no need to backport to any open source release. Closes scylladb/scylladb#21647 * github.com:scylladb/scylladb: materialized views: test for the MV delay configuration parameter service: add injection for skipping view update backlog materialized view: make flow-control maximum delay configurable	2024-12-16 14:20:33 +02:00
Raphael S. Carvalho	013e0d53ff	replica: Fix use-after-free due to a race between split and cleanup There is an assumption that every destroyed compaction_group will be stopped first. Otherwise, the group is still referenced by compaction manager and can use it after freed. That's what happened in issue #21867 in the context of merge. The issue is pre-existing but was made more likely with merge. One problem is a race between split and cleanup, where if split is emitted while cleanup is stopping groups, it can happen split preparation adds new groups that will never be closed, since cleanup is already past the group stopping step. Another problem found is that split completion handler is not accounting for possible existence of merging groups, if split happens right after merge. Split completion handler should stop all empty groups that previously had data split from them. The problems will be fixed by guaranteeing that new groups will not be added for a tablet being migrated away, and that empty groups are properly closed when handling split completion. A reproducer was added. Fixes #21867. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#21920	2024-12-16 13:19:26 +02:00
Artsiom Mishuta	e4dc86b552	fix(test.py): adjust break_manager method remove unnecessary _mark_dirty call server_broken_event - stop the whole file execution (prevent the next tests from running because Pyhon server object is broken PR: scylladb/scylladb#18236). and next file execution will create its new cluster so _mark_dirty will not change anything Closes scylladb/scylladb#21429	2024-12-16 11:24:03 +01:00
Gleb Natapov	c5f1dc6293	test: move network_topology_strategy_test and token_metadata_test to use host id based APIs	2024-12-15 11:31:11 +02:00
Botond Dénes	5880a1b90b	Merge 'tasks: add tablet migration virtual task' from Aleksandra Martyniuk In this change, tablet_virtual_task starts supporting tablet migration, in addition to tablet repair. Both tablet operations reuse the same virtual_task because their task data is retrieved similarly. However, it changes nothing from the task manager API users' perspective. They can list running migrations or check their statuses all the same as if migration had its own virtual_task. Users can see running migration tasks - finished tasks are not presented with the task manager API. However, the result of the migration (whether it succeeded or failed) would be presented to users, if they use wait API. If a migration was reverted, it will appear to users as failed. We assume that the migration was reverted, when its destination does not contain a tablet replica. Fixes: https://github.com/scylladb/scylladb/issues/21365. No backport, new feature Closes scylladb/scylladb#21729 * github.com:scylladb/scylladb: test: boost: check migration_task_info in tablet_test.cc replica: add repair related fields to tablet_map_to_mutation test: add tests to check the failed migration virtual tasks test: add tests to check the list of migration virtual tasks test: add tests to check migration virtual tasks status test: topology_tasks: generalize repair task functions service: extend tablet_virtual_task::abort service: extend tablet_virtual_task::wait service: extend tablet_virtual_task::get_status_helper service: extend tablet_virtual_task::contains service: extend tablet_virtual_task::get_stats service: tasks: make get_table_id a method of virtual_task_hint service: tasks: extend virtual_task_hint replica: service: add migration_task_info column to system.tablets locator: extend tablet_task_info to cover migration tasks locator: rename tablet_task_info methods	2024-12-13 10:54:03 +02:00
muthu90tech	e49381119d	locator: topology: use node& instead of node* This change goes thru locator:topology to use node& instead of node* where nullptr is not possible. There are places where the node object is used in unordered_set, in those cases the node is wrapped in std::reference_wrapper. Fixes scylladb/scylladb#20357 Closes scylladb/scylladb#21863	2024-12-12 13:22:55 +01:00
Aleksandra Martyniuk	8943188442	test: boost: check migration_task_info in tablet_test.cc	2024-12-12 11:40:55 +01:00
Botond Dénes	05246e123d	Merge 'sstables: Avoid computing column_values_fixed_lengths on each read' from Tomasz Grabiec Reads which need sstable index were computing column_values_fixed_lengths each time. This showed up in perf profile for a sstable-read heavy workload, and amounted to about 1-2% of time. Computing it involves type name parsing. Avoid by using cached per-sstable mapping. There is already sstable::_column_translation which can be used for this. It caches the mapping for the least-recently used schema. Since the cursor uses the mapping only for primary key columns, which are stable, any schema will do, so we can use the last _column_translation. We only need to make sure that it's always armed, so sstable loading is augmented with arming with sstable's schema. Also, fixes a potential use-after-free on schema in column_translation. Closes scylladb/scylladb#21347 * github.com:scylladb/scylladb: sstables: Fix potential use-after-free on column_translation::column_info::name sstables: Avoid computing column_values_fixed_lengths on each read	2024-12-12 12:22:32 +02:00
Aleksandra Martyniuk	bc17535427	test: add tests to check the failed migration virtual tasks	2024-12-11 15:17:16 +01:00
Aleksandra Martyniuk	be8dfd220f	test: add tests to check the list of migration virtual tasks	2024-12-11 15:17:16 +01:00
Aleksandra Martyniuk	b473efbefd	test: add tests to check migration virtual tasks status	2024-12-11 15:17:15 +01:00
Aleksandra Martyniuk	c81dcfc465	test: topology_tasks: generalize repair task functions Generalize repair task functions so that they can be reused for other tablet tasks.	2024-12-11 15:17:15 +01:00
Michael Litvak	373855b493	service/qos/service_level_controller: update cache on startup Update the service level cache in the node startup sequence, after the service level and auth service are initialized. The cache update depends on the service level data accessor being set and the auth service being initialized. Before the commit, it may happen that a cache update is not triggered after the initialization. The commit adds an explicit call to update the cache where it is guaranteed to be ready. Fixes scylladb/scylladb#21763 Closes scylladb/scylladb#21773	2024-12-11 12:05:28 +01:00
Tomasz Grabiec	440a96605f	Merge 'topology_custom/test_tablets: add remove/replace tests for edge cases' from Benny Halevy Test cases related to #21826: 1. test_remove_failure_with_no_normal_token_owners_in_dc: attempts to remove a node with another node down in the datacenter, leaving no normal token owners in that dc (reproducing #21826). Removenode is expected to fail in this case since it should have no place to rebuild the removed node replicas, yet it currently succeeds unexpectedly. 2. test_remove_failure_then_replace: verify that removenode fails as expected when there are not enough nodes to rebuild its replicas on, with and without additional zero-token nodes. 3. test_replace_with_no_normal_token_owners_in_dc: verify that nodes can be replaced in a datacenter that has no live token owners, with and without additional zero-token nodes. Tablet replace uses all replicas to rebuild the lost replicas and therefore should succeed in the edge case. The restored data is verified as well. Refs #21826 * New tests, no backport needed Closes scylladb/scylladb#21827 * github.com:scylladb/scylladb: topology_custom/test_tablets: add remove/replace tests for edge cases test: pylib: _cluster_remove_node: log message on successful paths test: pylib: _cluster_remove_node: mark server as removed only when removenode succeeded	2024-12-11 12:04:14 +01:00
Benny Halevy	b95312064f	topology_custom/test_tablets: add remove/replace tests for edge cases Test cases related to #21826: 1. test_remove_failure_with_no_normal_token_owners_in_dc: attempts to remove a node with another node down in the datacenter, leaving no normal token owners in that dc (reproducing #21826). Removenode is expected to fail in this case since it should have no place to rebuild the removed node replicas, yet it currently succeeds unexpectedly. 2. test_remove_failure_then_replace: verify that removenode fails as expected when there are not enough nodes to rebuild its replicas on, with and without additional zero-token nodes. 3. test_replace_with_no_normal_token_owners_in_dc: verify that nodes can be replaced in a datacenter that has no live token owners, with and without additional zero-token nodes. Tablet replace uses all replicas to rebuild the lost replicas and therefore should succeed in the edge case. The restored data is verified as well. Refs #21826 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-10 21:39:15 +02:00
Tomasz Grabiec	8e60a0b831	Merge 'truncate: make TRUNCATE TABLE safe with tablets' from Ferenc Szili Currently truncating a table works by issuing an RPC to all the nodes which call `database::truncate_table_on_all_shards()`, which makes sure that older writes are dropped. It works with tablets, but is not safe. A concurrent replication process may bring back old data. This change makes makes TRUNCATE TABLE a topology operation, so that it excludes with other processes in the system which could interfere with it. More specifically, it makes TRUNCATE a global topology request. Backporting is not needed. Fixes #16411 Closes scylladb/scylladb#19789 * github.com:scylladb/scylladb: docs: docs: topology-over-raft: Document truncate_table request storage_proxy: fix indentation and remove empty catch/rethrow test: add tests for truncate with tablets storage_proxy: use new TRUNCATE for tablets truncate: make TRUNCATE a global topology operation storage_service: move logic of wait_for_topology_request_completion() RPC: add truncate_with_tablets RPC with frozen_topology_guard feature_service: added cluster feature for system.topology schema change system.topology_requests: change schema storage_proxy: propagate group0 client and TSM dependency	2024-12-10 17:50:50 +01:00
Tomasz Grabiec	bf18a17bd6	tablets: scheduler: Fix temporary imbalance in a mixed-capacity cluster on decommission When tablet scheduler drains nodes, it chooses target location based on "badness" metric. Nodes with lowest score are preferred. Before the patch, the score which was used was the number of tablets on that node post-movement. This way we populate least-loaded node first. But this works only if nodes have equal number of shards. If nodes have different capacity, then number of tablets is not a good metric, because we don't aim to equalize per-node count, but per-shard count. We assume that each shard has equal capacity. Because of this bug, during decommission, the nodes with fewer shards would be preferred to receive replicas, which may lead to overloading of those nodes. This imbalance would be later fixed by the normal load balancing logic, but it's still problematic. Fixes #21783 Closes scylladb/scylladb#21860	2024-12-10 14:18:03 +02:00

1 2 3 4 5 ...

7985 Commits