scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 12:17:02 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	eef798d84f	Merge 'Distribute data evenly among primary replicas during restore' from Robert Bindar Most likely `817fdad` uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope. This created situations where for instance, restoring into a 9 nodes with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step. This PR fixes this by changing the formula for picking up the primary replica out of a set of eligible replicas from within the passed scope. The PR also extends the testing scenarios in `test_backup.py` so we get to run restore for a set of topologies, for all combinations of scope, primary_replica_only and min_tablet_counts. Most of the work was done by @bhalevy [here](https://github.com/scylladb/scylladb/compare/master...bhalevy:scylla:load-balance-primary-replica), this PR just splitted it and did touchups here and there. Fixes #27281 Closes scylladb/scylladb#27397 * github.com:scylladb/scylladb: test: reduce dataset and number of test cases or debug builds test: bump repair timeout up, it's sometimes not enough in CI test: refactor test_refresh.py to match test_restore_with_streaming_scopes. test: extend test_restore_with_streaming_scopes test: Adjust test_restore_primary_replica_different_dc_scope_all test: Refactor restoring code in test_backup to match SM pattern test: add check_mutation_replicas calls after fresh creation of dataset test: extend create_dataset to accept consistency_level test: refactor check_mutation_replicas so it's more readable test: make create_dataset async and refactor so it's configurable test: use defaultdict in collect_mutations test: add log marks to facilitate reusing server for restore locator: tablets: Distribute data evenly among primary replicas during restore	2026-01-14 18:57:55 +01:00
Robert Bindar	d88036db48	locator: tablets: Distribute data evenly among primary replicas during restore Most likely `817fdad` uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope. This created situations where for instance, restoring into a 9 nodes cluster with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step. split from bhalevy/load-balance-primary-replica Fixes #27281 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:44:20 +02:00
Botond Dénes	04b8f72946	Merge 'repair: Implement auto repair for tablet repair' from Asias He repair: Implement auto repair for tablet repair This patch implements the basic auto repair support for tablet repair. It was decided to add no per table configuration for the initial implementation, so two scylla yaml config options are introduced to set the default auto repair configs for all the tablet tables. - auto_repair_enabled_default Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. - auto_repair_threshold_default_in_seconds Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. The following metrcis are added: - auto_repair_needs_repair_nr The number of tablets with auto repair enabled that needs repair - auto_repair_enabled_nr The number of tablets with auto repair enabled The metrics are useful to tell if auto repair is falling behind. In the future, more auto repair scheduling will be added, e.g., scheduling based on the repaired and unrepaired sstable set size, tombstone ratio and so on, in addition to the time based scheduling. Fixes SCYLLADB-99 New feature. No backport. Closes scylladb/scylladb#27534 * github.com:scylladb/scylladb: topology_coordinator: Add metrics for tablet repair repair: Implement auto repair for tablet repair	2026-01-12 14:16:01 +02:00
Asias He	7ba7b25bdd	repair: Implement auto repair for tablet repair This patch implements the basic auto repair support for tablet repair. It was decided to add no per table configuration for the initial implementation, so two scylla yaml config options are introduced to set the default auto repair configs for all the tablet tables. - auto_repair_enabled_default Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. - auto_repair_threshold_default_in_seconds Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. The following metrcis are added: - auto_repair_needs_repair_nr The number of tablets with auto repair enabled that needs repair - auto_repair_enabled_nr The number of tablets with auto repair enabled The metrics are useful to tell if auto repair is falling behind. In the future, more auto repair scheduling will be added, e.g., scheduling based on the repaired and unrepaired sstable set size, tombstone ratio and so on, in addition to the time based scheduling. Fixes SCYLLADB-99	2026-01-09 16:11:39 +08:00
Botond Dénes	60570d7114	Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations: * Altering a keyspace's RF if it would make the keyspace RF-rack-invalid * Adding a node in a new rack * Removing / Decommissioning the last node in a rack Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views. The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid. Fixes scylladb/scylladb#23345 Fixes https://github.com/scylladb/scylladb/issues/26820 backport to relevant versions for materialized views with tablets since it depends on rf-rack validity Closes scylladb/scylladb#26354 * github.com:scylladb/scylladb: docs: update RF-rack restrictions cql3: don't apply RF-rack restrictions on vector indexes cql3: add warning when creating mv/index with tablets about rf-rack service/tablet_allocator: always allow tablet merge of tables with views locator: extend rf-rack validation for rack lists test: test rf-rack validity when creating keyspace during node ops locator: fix rf-rack validation during node join/remove test: test topology restrictions for views with tablets test: add test_topology_ops_with_rf_rack_valid topology coordinator: restrict node join/remove to preserve RF-rack validity topology coordinator: add validation to node remove locator: extend rf-rack validation functions view: change validate_view_keyspace to allow MVs if RF=Racks db: enforce rf-rack-validity for keyspaces with views replica/db: add enforce_rf_rack_validity_for_keyspace helper db: remove enforce parameter from check_rf_rack_validity test: adjust test to not break rf-rack validity	2026-01-09 10:01:23 +02:00
Łukasz Paszkowski	62313a6264	load_sketch: Allow populating load_sketch with normalized current load Currently, tablet allocation intentionally ignores current load ( introduced by the commit #1e407ab) which could cause identical shard selection when allocating a small number of tablets in the same topology. When a tablet allocator is asked to allocate N tablets (where N is smaller than the number of shards on a node), it selects the first N lowest shards. If multiple such tables are created, each allocator run picks the same shards, leading to tablet imbalance across shards. This change initializes the load sketch with the current shard load, scaled into the [0,1] range, ensuring allocation still remains even while starting from globally least-loaded shards. Fixes https://github.com/scylladb/scylladb/issues/27620 Closes scylladb/scylladb#27802	2026-01-07 11:49:01 +01:00
Ferenc Szili	621cb19045	load_sketch: use tablet sizes in load computation This commit changes load_sketch so that it computes node and shard load based on tablet sizes instead of tablet count.	2025-12-27 10:37:23 +01:00
Ferenc Szili	1c9ec9a76d	load_stats: add get_tablet_size_in_transition() This patch adds a method to load_stats which searches for the tablet size during tablet transition. In case of tablet migration, the tablet will be searched on the leaving replica, and during rebuild we will return the average tablet size of the pending replicas.	2025-12-27 10:37:23 +01:00
Michael Litvak	07d85af433	locator: extend rf-rack validation for rack lists Extend the RF-rack validation in `assert_rf_rack_valid_keyspace` to validate rack-list-based replication as well. Previously, validation was done only for numeric replication. If the replication is based on a rack list, we validate that all racks that are required for replication are present in the topology rack map. If some rack is needed for replication but is missing, or it doesn't have normal token owner nodes, the validation fails with an error.	2025-12-22 09:14:30 +01:00
Michael Litvak	a738905a4b	locator: fix rf-rack validation during node join/remove If a keyspace is created while a node is joining or being removed, it could break the rf-rack invariant. For example: 1. We have 3 nodes in 3 racks, no keyspaces 2. A new node starts to join in a new rack - passes validation because there are no keyspaces 3. Create a keyspace with rf=3 - passes validation because the joining node is not a normal token owner yet 4. The new node becomes a normal token owner 5. The rf-rack invariant is broken. We have rf=3 and 4 racks To fix this, we change the rf-rack check to consider a node as a token owner if it's either a normal token owner or it has bootstrap tokens and is about to become a normal token owner. Now the condition can't be broken. Consider keyspace creation at different stages of adding a node in our example: * Before the node is assigned bootstrap tokens: the node is not considered. We can create a keyspace with rf=3 as if the node doesn't exist, and then node join will fail in the group0 operation that assigns bootstrap tokens, because during this operation we check rf-rack validity. * Assigning bootstrap tokens is a single group0 operation that is serialized with keyspace creation. During this operation we check that adding the node as a token owner will maintain rf-rack validity for all keyspaces. * After the node is assigned bootstrap tokens and until it becomes a normal token owner: it is considered as a transitioning token owner by the rf-rack check and the rack is considered a transitioning rack. We can't count the rack as a normal rack because the node join may still fail and rollback. Trying to create a keyspace with either rf=3 or rf=4 will fail because we can end up with either 3 or 4 racks. Similarly, when removing a node, we validate that removing the node will maintain rf-rack validity in the same group0 operation that changes the node state to removing/decommissioning, after which the node becomes a leaving endpoint, and it's not considered a normal token owner anymore for the rf-rack check.	2025-12-22 09:14:30 +01:00
Michael Litvak	9e1f78d162	locator: extend rf-rack validation functions Extend the locator function assert_rf_rack_valid_keyspace to accept arbitrary topology dc-rack maps and nodes instead of using the current token metadata. This allows us to add a new variant of the function that checks rf-rack validity given a topology change that we want to apply. we will use it to check that rf-rack validity will be maintained before applying the topology change. The possible topology changes for the check are node add and node remove / decommission. These operations can change the number of normal racks - if a new node is added to a new rack, or the last node is removed from a rack.	2025-12-22 09:14:29 +01:00
Michael Litvak	de1bb84fca	db: enforce rf-rack-validity for keyspaces with views Extend the RF-rack-validity enforcement to keyspaces that have views, regardless of the option `rf_rack_valid_keyspaces`. Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces` was set for all keyspaces. Now we want to allow creating MVs in tablet keyspaces that are RF-rack-valid and enforce the RF-rack-validity even if the config option is not set.	2025-12-22 09:13:49 +01:00
Dawid Mędrek	1e14c08eee	locator/token_metadata: Remove get_host_id() The function is declared, but it's not defined or used anywhere. Closes scylladb/scylladb#27374	2025-12-15 10:36:52 +01:00
Benny Halevy	c8cff94a5a	api: storage_service/tablets/repair: disable incremental repair by default Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-11 14:25:21 +02:00
Tomasz Grabiec	d6c14de380	Merge 'locator/node: include _excluded in missing places' from Patryk Jędrzejczak We currently ignore the `_excluded` field in `node::clone()` and the verbose formatter of `locator::node`. The first one is a bug that can have unpredictable consequences on the system. The second one can be a minor inconvenience during debugging. We fix both places in this PR. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72 This PR is a bugfix that should be backported to all supported branches. Closes scylladb/scylladb#27265 * github.com:scylladb/scylladb: locator/node: include _excluded in verbose formatter locator/node: preserve _excluded in clone()	2025-11-26 18:29:59 +01:00
Patryk Jędrzejczak	287c9eea65	locator/node: include _excluded in verbose formatter It can be helpful during debugging.	2025-11-26 13:26:17 +01:00
Patryk Jędrzejczak	4160ae94c1	locator/node: preserve _excluded in clone() We currently ignore the `_excluded` field in `clone()`. Losing information about exclusion can have unpredictable consequences. One observed effect (that led to finding this issue) is that the `/storage_service/nodes/excluded` API endpoint sometimes misses excluded nodes.	2025-11-26 13:26:11 +01:00
Patryk Jędrzejczak	cc273e867d	Merge 'fix notification about expiring erm held for to long' from Gleb Natapov Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should. Fixes https://github.com/scylladb/scylladb/issues/27141 Closes scylladb/scylladb#27140 * https://github.com/scylladb/scylladb: test: test that expired erm that held for too long triggers notification token_metadata: fix notification about expiring erm held for to long	2025-11-26 12:59:00 +01:00
Gleb Natapov	9f97c376f1	token_metadata: fix notification about expiring erm held for to long Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix assign operator to call destructor.	2025-11-25 13:35:24 +02:00
Radosław Cybulski	d589e68642	Add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime Closes scylladb/scylladb#26617	2025-11-21 12:27:41 +02:00
Asias He	d51b1fea94	tablets: Allow tablet merge when repair tasks exist Currently we do not allow tablet merge if either of the tablets contain a tablet repair request. This could block the tablet merge for a very long time if the repair requests could not be scheduled and executed. We can actually merge the repair tasks in most of the cases. This is because most of the time all tablets are requested to be repaired by a single API request, so they share the same task_id, request_type and other parameters. We can merge the repair task info and executes the repair after the merge. If they do not share the task info, we could not merge and have to wait for the repair before merge, which is both rare and ok. Another case is that one of the tablet has a repair task info (t1) while the other tablet (t2) does not have, it is possible the t2 has finished repair by the same repair request or t2 is not requested to be repaired at all. We allow merge in this case too to avoid blocking the tablet merge, with the price of reparing a bit more. Fixes #26844 Closes scylladb/scylladb#26922	2025-11-20 16:01:23 +01:00
Pavel Emelyanov	f47f2db710	Merge 'Support local primary-replica-only for native restore' from Robert Bindar This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with: - `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only - `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only. - `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself) - `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense. The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope. Fixes #26584 Closes scylladb/scylladb#26609 * github.com:scylladb/scylladb: Add cluster tests for checking scoped primary_replica_only streaming Improve choice distribution for primary replica Refactor cluster/object_store/test_backup nodetool restore: add primary-replica-only option nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only Enable scoped primary replica only streaming Support primary_replica_only for native restore API	2025-11-13 12:11:18 +03:00
Tomasz Grabiec	10b893dc27	Merge 'load_stats: fix bug in migrate_tablet_size()' from Ferenc Szili `topology_cooridinator::migrate_tablet_size()` was introduced in `10f07fb95a`. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search: ``` if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) { if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) { ``` This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats. This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method. A version containing this bug has not been released yet, so no backport is needed. Closes scylladb/scylladb#26946 * github.com:scylladb/scylladb: load_stats: add test for migrate_tablet_size() load_stats: fix problem with tablet size migration	2025-11-12 23:48:37 +01:00
Ferenc Szili	b77ea1b8e1	load_stats: fix problem with tablet size migration This patch fixes a bug with tablet size migration in load_stats. has_tablet_size() lambda in topology_coordinator::migrate_tablet_size() was returning false in all cases due to incorrect search iterator comparison after a table and tablet saeach. This change moves load_stats migrate_tablet_sizes() functionaility into a separate method of load_stats.	2025-11-11 14:26:09 +01:00
Benny Halevy	a290505239	utils: stall_free: add dispose_gently dispose_gently consumes the object moved to it, clearing it gently before it's destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26356	2025-11-11 12:20:18 +02:00
Robert Bindar	817fdadd49	Improve choice distribution for primary replica I noticed during tests that `maybe_get_primary_replica` would not distribute uniformly the choice of primary replica because `info.replicas` on some shards would have an order whilst on others it'd be ordered differently, thus making the function choose a node as primary replica multiple times when it clearly could've chosen a different nodes. This patch sorts the replica set before passing it through the scope filter. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Avi Kivity	d458dd41c6	Merge 'Avoid input_/output_stream-s default initialization and move-assignment' from Pavel Emelyanov Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one. Adopting newer seastar API, no need to backport Closes scylladb/scylladb#26747 * github.com:scylladb/scylladb: commitlog: Remove unused work::r stream variable ec2_snitch: Fix indentation after previous patch ec2_snitch: Coroutinize the aws_api_call_once() sstable: Construct output_stream for data instantly test: Don't reuse on-stack input stream	2025-10-31 21:22:41 +02:00
Tomasz Grabiec	1c0d847281	Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations. This is the second part of the size based load balancing changes: - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 This is a new feature and backport is not needed. Closes scylladb/scylladb#26152 * github.com:scylladb/scylladb: load_balancer: load_stats reconcile after tablet migration and table resize load_stats: change data structure which contains tablet sizes	2025-10-31 09:58:25 +01:00
Tomasz Grabiec	28f6bdc99b	cql3: ks_prop_defs: Expand numeric RF to rack list Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for new DCs specified in the statement. Doesn't auto-expand existing options, as the rack choice may not be in line with current replica placement. This requires co-locating tablet replicas, and tracking of co-location state, which is not implemented yet. Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-29 23:32:59 +01:00
Tomasz Grabiec	35166809cb	locator: Move rack_list to topology.hh So that we can use it in locator/tablets.hh and avoid circular dependency between that header and abstract_replication_strategy.hh	2025-10-29 23:32:58 +01:00
Pavel Emelyanov	92462e502f	ec2_snitch: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:31:08 +03:00
Pavel Emelyanov	7640ade04d	ec2_snitch: Coroutinize the aws_api_call_once() The method connects a socket, grabs in/out streams from it then writes HTTP request and reads+parses the response. For that it uses class variables for socket and streams, but there's no real need for that -- all three actually exists throughput the method "lifetime". To fix it, coroutinizes the method. The same could be achieved my moving the connected socket and streams into do_with() context, but coroutine is better than that. (indentation is left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:29:25 +03:00
Ferenc Szili	10f07fb95a	load_balancer: load_stats reconcile after tablet migration and table resize This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to issue migrations which improve load balance.	2025-10-28 12:12:09 +01:00
Aleksandra Martyniuk	910cd0918b	locator: use get_primary_replica for get_primary_endpoints Currently, tablet_sstable_streamer::get_primary_endpoints is out of sync with tablet_map::get_primary_replica. The get_primary_replica optimizes the choice of the replica so that the work is fairly distributes among nodes. Meanwhile, get_primary_endpoints always chooses the first replica. Use get_primary_replica for get_primary_endpoints. Fixes: https://github.com/scylladb/scylladb/issues/21883. Closes scylladb/scylladb#26385	2025-10-28 09:56:08 +02:00
Patryk Jędrzejczak	e1c3f666c9	Merge 'vnode cleanup: add missing barriers and fix race conditions' from Petr Gusev Problems addressed by this PR * Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up. * Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations. * `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details). * Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`: * The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited. * Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between. * It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them. * It also waited for writes with versions greater than the current `fence_version`, which is unnecessary. Fixes scylladb/scylladb#26150 backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting Closes scylladb/scylladb#26315 * https://github.com/scylladb/scylladb: test_automatic_cleanup: add test_cleanup_waits_for_stale_writes test_fencing: fix due to new version increment test_automatic_cleanup: clean it up storage_proxy: wait for closing sessions in sstable cleanup fiber storage_proxy: rename await_pending_writes -> await_stale_pending_writes storage_proxy: use run_fenceable_write storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check storage_proxy: introduce run_fenceable_write storage_proxy: move update_fence_version from shared_token_metadata storage_proxy: fix start_write() operation scope in apply_locally storage_proxy: move post fence check into handle_write storage_proxy: move fencing into mutate_counter_on_leader_and_replicate storage_proxy::handle_read: add fence check before get_schema storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber sstable_cleanup_fiber: use coroutine::parallel_for_each storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock topology_coordinator: barrier before cleanup topology_coordinator: small start_cleanup refactoring global_token_metadata_barrier: add fenced flag	2025-10-27 12:35:13 +01:00
Ferenc Szili	b4ca12b39a	load_stats: change data structure which contains tablet sizes This patch changes the tablet size map in load_stats. Previously, this data structure was: std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes; and is changed into: std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes; This allows for improved performance of tablet tablet size reconciliation.	2025-10-24 14:37:00 +02:00
Petr Gusev	c5f447224a	storage_proxy: move update_fence_version from shared_token_metadata Future commits will extend update_fence_version, and it is simpler to do so if the function resides in storage_proxy. Additionally, fence_version is the only field this function accesses, and it is used solely within storage_proxy, making this change natural on its own.	2025-10-22 16:31:43 +02:00
Petr Gusev	b23f2a2425	tablet_metadata_guard: fix split/merge handling The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the tablet_id field (_tablet), which means the guard can no longer correctly protect ongoing operations from tablet migrations. Fixes scylladb/scylladb#26437	2025-10-22 11:32:37 +02:00
Petr Gusev	ec6fba35aa	tablet_metadata_guard: add debug logs	2025-10-22 11:32:37 +02:00
Tomasz Grabiec	c4a87453a2	Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace. Closes scylladb/scylladb#26116 * github.com:scylladb/scylladb: schema: Allow configuring consistency setting for a keyspace db: experimental consistent-tablets option	2025-10-16 21:48:06 +02:00
Gleb Natapov	c255740989	schema: Allow configuring consistency setting for a keyspace We want to add strongly consistent tables as an option. We will have two kind of strongly consistent tables: globally consistent and locally consistent. The former means that requests from all DCs will be globally linearisable while the later - only requests to the same DCs will be linearisable. To allow configuring all the possibilities the patch adds new parameter to a keyspace definition "consistency" that can be configured to be `eventual`, `global` or `local`. Non eventual setting is supported for tablets enabled keyspaces only. Since we want to start with implementing local consistency configuring global consistency will result in an error for now.	2025-10-16 13:34:49 +03:00
Marcin Maliszkiewicz	d67632bfe2	replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge This copy is now used during the whole duration of schema merge. If it changes due to tablet_hint then it's replicated to all shards as before.	2025-10-14 10:56:36 +02:00
Marcin Maliszkiewicz	46bff28a38	db: schema_applier: move pending_token_metadata to locator It never belonged to tables and views and its placement stems from location of _tablet_hint handling code. In the follwing commits we'll reference it in storage_service.cc.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	c112916215	db: refactor new_token_metadata into pending_token_metadata It prepares pending_token_metadata to handle both new and copy of existing metadata for consistent usage in later commit. It also adds shared_token_metatada getter so that we don't need to get it from db.	2025-10-14 10:56:26 +02:00
Asias He	13dd88b010	repair: Rename incremental mode name Using the name regular as the incremental mode could be confusing, since regular might be interpreted as the non-incremental repair. It is better to use incremental directly. Before: - regular (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) After: - incremental (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) Fixes #26503 Closes scylladb/scylladb#26504	2025-10-10 15:21:54 +03:00
Piotr Dulikowski	380f243986	Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names. For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] } Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs. Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks. Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported. New feature, no backport required. Co-authored with @bhalevy Fixes https://github.com/scylladb/scylladb/issues/25269 Fixes https://github.com/scylladb/scylladb/issues/23525 Closes scylladb/scylladb#26358 * github.com:scylladb/scylladb: tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count locator: Make hasher for endpoint_dc_rack globally accessible test: tablets: Add test for replica allocation on rack list changes test: lib: topology_builder: generate unique rack names test: Add tests for rack list RF doc: Document rack-list replication factor topology_coordinator: Restore formatting topology_coordinator: Cancel keyspace alter on broader set of errors topology_coordinator: Make keyspace alter process options through as_ks_metadata_update() cql3: ks_prop_defs: Preserve old options cql3: ks_prop_defs: Introduce flattened() locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace() tablet_allocator: Respect binding replicas to racks locator: network_topology_strategy: Respect rack list when reallocating tablets cql3: ks_prop_defs: Fail with more information when options are not in expected format locator, cql3: Support rack lists in replication options cql3: Fail early on vnode/tablet flavor alter cql3: Extract convert_property_map() out of Cql.g schema: Use definition from the header instead of open-coding it locator: Abstract obtaining the number of replicas from replication_strategy_config_option cql3, locator: Use type aliases for option maps locator: Add debug logging locator: Pass topology to replication strategy constructor abstract_replication_strategy, network_topology_strategy: add replication_factor_data class	2025-10-06 14:14:09 +02:00
Ferenc Szili	20aeed1607	load balancing: extend locator::load_stats to collect tablet sizes This commit extend the TABLE_LOAD_STATS RPC with data about the tablet replica sizes and effective disk capacity. Effective disk capacity of a node is computed as a sum of the sizes of all tablet replicas on a node and available disk space. This is the first change in the size based load balancing series. Closes scylladb/scylladb#26035	2025-10-03 13:37:22 +02:00
Tomasz Grabiec	6962464be7	locator: Make hasher for endpoint_dc_rack globally accessible	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	6b7b0cb628	locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	6de342ed3e	locator: network_topology_strategy: Respect rack list when reallocating tablets	2025-10-02 19:42:39 +02:00

1 2 3 4 5 ...

1105 Commits