scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 03:20:37 +00:00

Author	SHA1	Message	Date
Patryk Jędrzejczak	c13b6c91d3	Merge 'raft topology: drop changing the raft voters config via storage_service' from Emil Maskovsky For the limited voters feature to work properly we need to make sure that we are only managing the voter status through the topology coordinator. This means that we should not change the node votership from the storage_service module for the raft topology directly. We can drop the voter status changes from the storage_service module because the topology coordinator will handle the votership changes eventually. The calls in the storage_service module were not essential and were only used for optimization (improving the HA under certain conditions). Furthermore, the other bundled commit improves the reaction again by reacting to the node `on_up()` and `on_down()` events, which again shortens the reaction time and improves the HA. The change has effect on the timing in the tablets migration test though, as it previously relied on the node being made non-voter from the service_storage `raft_removenode()` function. The fix is to add another server to the topology to make sure we will keep the quorum. Previously the test worked because the test waits for an injection to be reached and it was ensured that the injection (log line) has only been triggered after the node has been made non-voter from the `raft_removenode()`. This is not the case anymore. An alternative fix would be to wait for the first node to be made non-voter before stopping the second server, but this would make the test more complex (and it is not strictly required to only use 4 servers in the test, it has been only done for optimization purposes). Fixes: scylladb/scylladb#22860 Refs: scylladb/scylladb#18793 Refs: scylladb/scylladb#21969 No backport: Part of the limited voters new feature, so this shouldn't to be backported. Closes scylladb/scylladb#22847 * https://github.com/scylladb/scylladb: raft: use direct return of future for `run_op_with_retry` raft: adjust the voters interface to allow atomic changes raft topology: drop removing the node from raft config via storage_service raft topology: drop changing the raft voters config via storage_service	2025-03-04 13:59:47 +01:00
Amnon Heiman	cbae9a4abe	service/storage_proxy.cc: label metrics with basic_level and cas The following metrics will be marked with basic_level label: scylla_storage_proxy_coordinator_background_reads scylla_storage_proxy_coordinator_background_writes scylla_storage_proxy_coordinator_cas_background scylla_storage_proxy_coordinator_cas_dropped_prune scylla_storage_proxy_coordinator_cas_failed_read_round_optimization scylla_storage_proxy_coordinator_cas_foreground scylla_storage_proxy_coordinator_cas_prune scylla_storage_proxy_coordinator_cas_read_contention_bucket scylla_storage_proxy_coordinator_cas_read_contention_count scylla_storage_proxy_coordinator_cas_read_latency_count scylla_storage_proxy_coordinator_cas_read_latency_sum scylla_storage_proxy_coordinator_cas_read_timeouts scylla_storage_proxy_coordinator_cas_read_unavailable scylla_storage_proxy_coordinator_cas_read_unfinished_commit scylla_storage_proxy_coordinator_cas_total_operations scylla_storage_proxy_coordinator_cas_write_condition_not_met scylla_storage_proxy_coordinator_cas_write_contention_count scylla_storage_proxy_coordinator_cas_write_latency_count scylla_storage_proxy_coordinator_cas_write_latency_sum scylla_storage_proxy_coordinator_cas_write_timeout_due_to_uncertainty scylla_storage_proxy_coordinator_cas_write_timeouts scylla_storage_proxy_coordinator_cas_write_unavailable scylla_storage_proxy_coordinator_cas_write_unfinished_commit scylla_storage_proxy_coordinator_current_throttled_base_writes scylla_storage_proxy_coordinator_foreground_reads scylla_storage_proxy_coordinator_foreground_writes scylla_storage_proxy_coordinator_range_timeouts scylla_storage_proxy_coordinator_range_unavailable scylla_storage_proxy_coordinator_read_errors_local_node scylla_storage_proxy_coordinator_read_latency_count scylla_storage_proxy_coordinator_read_latency_sum scylla_storage_proxy_coordinator_reads_local_node scylla_storage_proxy_coordinator_reads_remote_node scylla_storage_proxy_coordinator_read_timeouts scylla_storage_proxy_coordinator_read_unavailable scylla_storage_proxy_coordinator_speculative_data_reads scylla_storage_proxy_coordinator_speculative_digest_reads scylla_storage_proxy_coordinator_total_write_attempts_local_node scylla_storage_proxy_coordinator_write_errors_local_node scylla_storage_proxy_coordinator_write_latency_bucket scylla_storage_proxy_coordinator_write_latency_count scylla_storage_proxy_coordinator_write_latency_sum scylla_storage_proxy_coordinator_write_timeouts scylla_storage_proxy_coordinator_write_unavailable scylla_storage_proxy_replica_received_counter_updates All cas related metrics are labeled with __cas label. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	83bfcb53be	service/storage_service.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_node_operation_mode Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Emil Maskovsky	8c67307971	raft: use direct return of future for `run_op_with_retry` Clean up the code by using direct return of future for `run_op_with_retry`. This can be done as the `run_op_with_retry` function is already returning a future that we can reuse directly. What needs to be taken care of is to not use temporaries referenced from inside the lambda passed to the `run_op_with_retry`.	2025-03-03 15:19:58 +01:00
Emil Maskovsky	28d1aeb1fa	raft: adjust the voters interface to allow atomic changes Allow setting the voters and non-voters in a single operation. This ensures that the configuration changes are done atomically. In particular, we don't want to set voters and non-voters separately because it could lead to inconsistencies or even the loss of quorum. This change also partially reverts the commit `115005d`, as we will only need the convenience wrappers for removing the voters (not for adding them). Refs: scylladb/scylladb#18793	2025-03-03 15:19:58 +01:00
Emil Maskovsky	074f4fcdf1	raft topology: drop removing the node from raft config via storage_service For the limited voters feature to work properly we need to make sure that we are only managing the voter status through the topology coordinator. This means that we should not change the node votership from the storage_service module for the raft topology directly. This needs to be done in addition to dropping of the votership change from the storage_service module. The `remove_from_raft_config` is redundant and can be removed because a successfully completed `removenode` operation implies that the node has been removed from group 0 by the topology coordinator. Refs: scylladb/scylladb#22860 Refs: scylladb/scylladb#18793 Refs: scylladb/scylladb#21969	2025-03-03 15:15:43 +01:00
Emil Maskovsky	834f506790	raft topology: drop changing the raft voters config via storage_service For the limited voters feature to work properly we need to make sure that we are only managing the voter status through the topology coordinator. This means that we should not change the node votership from the storage_service module for the raft topology directly. We can drop the voter status changes from the storage_service module because the topology coordinator will handle the votership changes eventually. The calls in the storage_service module were not essential and were only used for optimization (improving the HA under certain conditions). This has effect on the timing in the tablets migration test though, as it relied on the node being made non-voter from the service_storage `raft_removenode()` function. The fix is to add another server to the topology to make sure we will keep the quorum. Previously the test worked because the test waits for an injection to be reached and it was ensured that the injection (log line) has only been triggered after the node has been made non-voter from the `raft_removenode()`. This is not the case anymore. An alternative fix would be to wait for the first node to be made non-voter before stopping the second server, but this would make the test more complex (and it is not strictly required to only use 4 servers in the test, it has been only done for optimization purposes). Fixes: scylladb/scylladb#22860 Refs: scylladb/scylladb#18793 Refs: scylladb/scylladb#21969	2025-03-03 15:15:43 +01:00
Tomasz Grabiec	0343235aa2	Merge 'tablets: repair: fix hosts and dcs filters behavior for tablet repair' from Aleksandra Martyniuk If hosts and/or dcs filters are specified for tablet repair and some replicas match these filters, choose the replica that will be the repair master according to round-robin principle (currently it's always the first replica). If hosts and/or dcs filters are specified for tablet repair and no replica matches these filters, the repair succeeds and the repair request is removed (currently an exception is thrown and tablet repair scheduler reschedules the repair forever). Fixes: https://github.com/scylladb/scylladb/issues/23100. Needs backport to 2025.1 that introduces hosts and dcs filters for tablet repair Closes scylladb/scylladb#23101 * github.com:scylladb/scylladb: test: add new cases to tablet_repair tests test: extract repiar check to function locator: add round-robin selection of filtered replicas locator: add tablet_task_info::selected_by_filters service: finish repair successfully if no matching replica found	2025-03-01 14:47:43 +01:00
Aleksandra Martyniuk	2b538d228c	locator: add round-robin selection of filtered replicas	2025-02-28 12:32:55 +01:00
Aleksandra Martyniuk	fe4e99d7b3	locator: add tablet_task_info::selected_by_filters Extract dcs and hosts filters check to a method.	2025-02-28 12:02:21 +01:00
Aleksandra Martyniuk	9bce40d917	service: finish repair successfully if no matching replica found If hosts and/or dcs filters are specified for tablet repair and no replica matches these filters, an exception is thrown. The repair fails and tablet repair scheduler reschedules it forever. Such a repair should actually succeed (as all specified relpicas were repaired) and the repair request should be removed. Treat the repair as successful if the filters were specified and selected no replica.	2025-02-28 11:50:52 +01:00
Kefu Chai	da9960db1c	tree: Fix polymorphic exception handling by using references Replace value-based exception catching with reference-based catching to address GCC warnings about polymorphic type slicing: ``` warning: catching polymorphic type ‘class seastar::rpc::stream_closed’ by value [-Wcatch-value=] ``` When catching polymorphic exceptions by value, the C++ runtime copies the thrown exception into a new instance of the specified type, slicing the actual exception and potentially losing important information. This change ensures all polymorphic exceptions are caught by reference to preserve the complete exception state. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23064	2025-02-26 23:15:16 +02:00
Avi Kivity	d99df7af6c	Merge 'Respect per-shard tablet goal and 10x default per-shard tablet count' from Tomasz Grabiec This series achieves two things: 1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards This will result in new tables having at least 10 tablet replicas per shard by default. We want this to reduce tablet load imbalance due to differences in tablet count per shard, where some shards have 1 tablet and some shards have 2 tablets. With higher tablet count per shard, this difference-by-one is less relevant. Fixes https://github.com/scylladb/scylladb/issues/21967 2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count The per-shard goal is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. The scaling is applied after computing desired tablet count due to all other factors: per-table tablet count hints, defaults, average tablet size. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. When creating a new table, its tablet count is determined by tablet scheduler using the scheduler logic, as if the table was already created. So any scaling due to per-shard tablet count goal is reflected immediately when creating a table. It may however still take some time for the system to shrink existing tables. We don't reject requests to create new tables. Fixes #21458 Closes scylladb/scylladb#22522 * github.com:scylladb/scylladb: config, tablets: Allow tablets_initial_scale_factor to be a fraction test: tablets_test: Test scaling when creating lots of tables test: tablets_test: Test tablet count changes on per-table option and config changes test: tablets_test: Add support for auto-split mode test: cql_test_env: Expose db config config: Make tablets_initial_scale_factor live-updateable tablets: load_balancer: Pick initial_scale_factor from config tablets, load_balancer: Fix and improve logging of resize decisions tablets, load_balancer: Log reason for target tablet count tablets: load_balancer: Move hints processing to tablet scheduler tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table tablets: load_balancer: Determine desired count from size separately from count from options tablets: load_balancer: Determine resize decision from target tablet count tablets: load_balancer: Allow splits even if table stats not available tablets: load_balancer: Extract make_sizing_plan() tablets: Add formatter for resize_decision::way_type tablets: load_balancer: Simplify resize_urgency_cmp() tablets: load_balancer: Keep config items as instance members locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology() tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard tablets: Set default initial tablet count scale to 10 tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology() tablets: load_balancer: Extract get_schema_and_rs() tablets: load_balancer: Drop test_mode	2025-02-24 17:59:26 +02:00
Patryk Jędrzejczak	de751cad03	Merge 'test/topology_experimental_raft: add test_topology_upgrade_stuck' from Piotr Dulikowski The test simulates the cluster getting stuck during upgrade to raft topology due to majority loss, and then verifies that it's possible to get out of the situation by performing recovery and redoing the upgrade. Fixes: #17410 Closes scylladb/scylladb#17675 * https://github.com/scylladb/scylladb: test/topology_experimental_raft: add test_topology_upgrade_stuck test.py: bump minimum python version to 3.11 test.py: move gather_safely to pylib utils cdc: generation: don't capture token metadata when retrying update test.py: topology: ignore hosts when waiting for group0 consistency raft: add error injection that drops append_entries topology_coordinator: add injection which makes upgrade get stuck	2025-02-24 11:02:32 +01:00
Patryk Jędrzejczak	78c227c521	Merge 'raft topology: Add support for raft topology init to happen before group0 initialization' from Abhinav Kumar Jha In the current scenario, the problem discovered is that there is a time gap between group0 creation and raft_initialize_discovery_leader call. Because of that, the group0 snapshot/apply entry enters wrong values from the disk(null) and updates the in-memory variables to wrong values. During the above time gap, the in-memory variables have wrong values and perform absurd actions. This PR removes the variable `_manage_topology_change_kind_from_group0` which was used earlier as a work around for correctly handling `topology_change_kind` variable, it was brittle and had some bugs (causing issues like scylladb/scylladb#21114). The reason for this bug that _manage_topology_change_kind used to block reading from disk and was enabled after group0 initialization and starting raft server for the restart case. Similarly, it was hard to manage `topology_change_kind` using `_manage_topology_change_kind_from_group0` correctly in bug free manner. Post `_manage_topology_change_kind_from_group0` removal, careful management of `topology_change_kind` variable was needed for maintaining correct `topology_change_kind` in all scenarios. So this PR also performs a refactoring to populate all init data to system tables even before group0 creation(via `raft_initialize_discovery_leader` function). Now because `raft_initialize_discovery_leader` happens before the group 0 creation, we write mutations directly to system tables instead of a group 0 command. Hence, post group0 creation, the node can read the correct values from system tables and correct values are maintained throughout. Added a new function `initialize_done_topology_upgrade_state` which takes care of updating the correct upgrade state to system tables before starting group0 server. This ensures that the node can read the correct values from system tables and correct values are maintained throughout. By moving `raft_initialize_discovery_leader` logic to happen before starting group0 server, and not as group0 command post server start, we also get rid of the potential problem of init group0 command not being the 1st command on the server. Hence ensuring full integrity as expected by programmer. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#21114 Closes scylladb/scylladb#22484 * https://github.com/scylladb/scylladb: storage_service: Remove the variable _manage_topology_change_kind_from_group0 storage_service: fix indentation after the previous commit raft topology: Add support for raft topology system tables initialization to happen before group0 initialization service/raft: Refactor mutation writing helper functions.	2025-02-20 14:42:39 +01:00
Raphael S. Carvalho	4d8a333a7f	storage_service: Don't retry split when table is dropped The split monitor wasn't handling the scenario where the table being split is dropped. The monitor would be unable to find the tablet map of such a table, and the error would be treated as a retryable one causing the monitor to fall into an endless retry loop, with sleeps in between. And that would block further splits, since the monitor would be busy with the retries. The fix is about detecting table was dropped and skipping to the next candidate, if any. Fixes #21859. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22933	2025-02-20 10:13:55 +01:00
Gleb Natapov	914c9f1711	treewide: include build_mode.hh for SCYLLA_BUILD_MODE_RELEASE where it is missing Fixes: #22914 Closes scylladb/scylladb#22915	2025-02-20 10:50:04 +03:00
Tomasz Grabiec	1a7023c85a	config, tablets: Allow tablets_initial_scale_factor to be a fraction We may want fewer than 1 tablets per shard in large clusters. The per-table option is a fraction, so for consistency, this should be too.	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	7e4a61953d	tablets: load_balancer: Pick initial_scale_factor from config So that it can be live-updated.	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	41789962ef	tablets, load_balancer: Fix and improve logging of resize decisions Resize is no longer only due to avg tablet size. Log avg tablet size as an information, not the reason, and log the true reason for target tablet count.	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	d1ccbee7f9	tablets, load_balancer: Log reason for target tablet count Helps in debugging.	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	029505b179	tablets: load_balancer: Move hints processing to tablet scheduler Hints have common meaning for all strategies, so the logic belongs more to make_sizing_plan(). As a side effect, we can reuse shard capacity computation across tables, which reduces computational complexity from O(tablesnodes) to O(tables DCs + nodes)	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	f1bda8d4c1	tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal The limit is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. The limit is configurable. It's a global per-cluster config which controls how many tablet replicas per shard in total we consider to be still ok. It controls tablet allocator behavior, when choosing initial tablet count. Even though it's a per-node config, we don't support different limits per node. All nodes must have the same value of that config. It's similar in that regard to other scheduler config items like tablets_initial_scale_factor and target_tablet_size_in_bytes.	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	94b5165ac7	tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table This makes decisions made by the scheduler consistent with decisions made on table creation, with regard to tablet count. We want to avoid over-allocation of tablets when table is created, which would then be reduced by the scheduler's scaling logic. Not just to avoid wasteful migrations post table creation, but to respect the per-shard goal. To respect the per-shard goal, the algorithm will no longer be as simple as looking at hints, and we want to share the algorithm between the scheduler and initial tablet allocator. So invoke the scheduler to get the tablet count when table is created.	2025-02-19 14:40:07 +01:00
Tomasz Grabiec	dd68c1e526	tablets: load_balancer: Determine desired count from size separately from count from options For debugging purposes. Later we will want to know which rule determined the count.	2025-02-19 14:40:07 +01:00
Tomasz Grabiec	e4c5e2ab55	tablets: load_balancer: Determine resize decision from target tablet count The flow is simpler this way, since the decision cannot now be mismatched with target tablet count.	2025-02-19 14:40:07 +01:00
Tomasz Grabiec	35192e2d6f	tablets: load_balancer: Allow splits even if table stats not available This is in preparation for using the sizing plan during table creation where we never have size stats, and hints are the only determining factor for target tablet count.	2025-02-19 14:40:07 +01:00
Tomasz Grabiec	d3ffea77e6	tablets: load_balancer: Extract make_sizing_plan() Resize plan making will now happen in two stages: 1) Determine desired tablet counts per table (sizing plan) 2) Schedule resize decisions We need intermediate step in the resize plan making, which gives us the planned tablet counts, so that we can plug this part of the algorithm into initial tablet allocation on table construction. We want decisisons made by the scheduler to be consistent with decisions made on table creation. We want to avoid over-allocation of tablets when table is created, which would then be reduced by the scheduler. Not just to avoid wasteful migrations post table creation, but to respect the per-shard goal. To respect the per-shard goal, the algorithm will no longer be as simple as looking at hints, and we want to share the algorithm between the scheduler and initial tablet allocator. Also, this sizing plan will be later plugged into a virtual table for observability.	2025-02-19 14:40:06 +01:00
Tomasz Grabiec	b7e5919fdd	tablets: load_balancer: Simplify resize_urgency_cmp() Logic is preserved since target tablet size is constant for all tables. Dropping d.target_max_tablet_size() will allow us to move it to the load_balancer scope.	2025-02-19 14:39:40 +01:00
Tomasz Grabiec	997007a2df	tablets: load_balancer: Keep config items as instance members It fits preexisting pattern for other config items, and makes the code less cluttered because we don't have to carry config items across calls.	2025-02-19 14:39:39 +01:00
Tomasz Grabiec	8eedb551b5	tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology() To insert preemption points later.	2025-02-19 14:38:49 +01:00
Tomasz Grabiec	eef18d879c	tablets: load_balancer: Extract get_schema_and_rs() For better readability.	2025-02-19 14:38:49 +01:00
Tomasz Grabiec	9d600dd783	tablets: load_balancer: Drop test_mode tablets_test is now creating proper schema in the database, so test_mode is no longer needed.	2025-02-19 14:38:48 +01:00
Avi Kivity	45b2026209	service: raft: drop unused dependency from group0_state_machine_merger.hh Reduces dependency load. Closes scylladb/scylladb#22781	2025-02-19 12:14:58 +03:00
Aleksandra Martyniuk	f8e4198e72	service: tasks: hold token_metadata_ptr in tablet_virtual_task Hold token_metadata_ptr in tablet_virtual_task methods that iterate over tablets, to keep the tablet_map alive. Fixes: https://github.com/scylladb/scylladb/issues/22316. Closes scylladb/scylladb#22740	2025-02-19 09:33:53 +02:00
Tomasz Grabiec	22386a6ceb	Merge 'truncate: don't fail on already waiting truncate for the same table' from Ferenc Szili Currently, we can not have more than one global topology operation at the same time. This means that we can not have concurrent truncate operations because truncate is implemented as a global topology operation. Truncate excludes with other topology operations, and has to wait for those to complete before truncate starts executing. This can lead to truncate timeouts. In these cases the client retries the truncate operation, which will check for ongoing global topology operations, and will fail with an "Another global topology request is ongoing, please retry." error. This can be avoided by truncate checking if the ongoing global topology operation is a truncate running for the same table who's truncate has just been requested again. In this case, we can wait for the ongoing truncate to complete instead of immediately failing the operation, and provide a better user experience. This is an improvement, backport is not needed. Closes #22166 Closes scylladb/scylladb#22371 * github.com:scylladb/scylladb: test: add test for re-cycling ongoing truncate operations truncate: add additional logging and improve error message during truncate storage_proxy: wait on already running truncate for the same table storage_proxy: allow multiple truncate table fibers per shard	2025-02-18 15:54:00 +01:00
Lakshmi Narayanan Sreethar	0f7d08d41d	topology_coordinator: handle_table_migration: do not continue after executing metadata barrier Return after executing the global metadata barrier to allow the topology handler to handle any transitions that might have started by a concurrect transaction. Fixes #22792 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#22793	2025-02-18 15:48:45 +01:00
Piotr Dulikowski	35df6bb6b2	Merge 'raft_rpc::send_append_entries: limit memory usage' from Petr Gusev Serializing `raft::append_request` for transmission requires approximately the same amount of memory as its size. This means when the Raft library replicates a log item to M servers, the log item is effectively copied M times. To prevent excessive memory usage and potential out-of-memory issues, we limit the total memory consumption of in-flight `raft::append_request` messages. Fixes scylladb/scylladb#14411 Closes scylladb/scylladb#22835 * github.com:scylladb/scylladb: raft_rpc::send_append_entries: limit memory usage fms: extract entry_size to log_entry::get_size	2025-02-17 14:11:12 +01:00
Piotr Dulikowski	f112d76422	raft: add error injection that drops append_entries It will be needed for a test that simulates the cluster getting stuck during upgrade. Specifically, it will be used to simulate network isolation and to prevent raft commands from reaching that node.	2025-02-17 12:28:52 +01:00
Piotr Dulikowski	cd1a336885	topology_coordinator: add injection which makes upgrade get stuck The injection will necessary for the test, introduced in the next commit, which verifies that it's possible to recover from an upgrade of raft topology which gets stuck.	2025-02-17 12:28:52 +01:00
Abhi	d7884cf651	storage_service: Remove the variable _manage_topology_change_kind_from_group0 This commit removes the variable _manage_topology_change_kind_from_group0 which was used earlier as a work around for correctly handling topology_change_kind variable, it was brittle and had some bugs. Earlier commits made some modifications to deal with handling topology_change_kind variable post _manage_topology_change_kind_from_group0 removal	2025-02-17 15:19:39 +05:30
Abhi	623e01344b	storage_service: fix indentation after the previous commit	2025-02-17 15:06:27 +05:30
Ferenc Szili	af3fb1941a	truncate: add additional logging and improve error message during truncate This change adds two log messages. One for the creation of the truncate global topology request, and another for the truncate timeout. This is added in order to help with tracking truncate operation events. It also extends the "Another global topology request is ongoing, please retry." error message with more information: keyspace and table name.	2025-02-17 10:18:29 +01:00
Ferenc Szili	e87768c5a0	storage_proxy: wait on already running truncate for the same table Currently, we can not have more than one global topology operation at the same time. This means that we can not have concurrent truncate operations because truncate is implemented as a global topology operation. Truncate excludes with other topology operations, and has to wait for those to complete before truncate starts executing. This can lead to truncate timeouts. In these cases the client retries the truncate operation, which will check for ongoing global topology operations, and will fail with an "Another global topology request is ongoing, please retry." error. This can be avoided by truncate checking if we have a truncate for the same table already queued. In this case, we can wait for the ongoing truncate to complete instead of immediatelly failing the operation, and provide a better user experience.	2025-02-17 10:18:20 +01:00
Avi Kivity	03ae67f9ea	tablets: load_balancer: don't log decisions to do nothing Demote do-nothing decisions to debug level, but keep them at info if we did decide to do nothing (such as migrate a tablet). Information about more major events (like split/merge) is kept at info level. Once log line that logs node information now also logs the datacenter, which was previously supplied by a log line that is now debug-only. Closes scylladb/scylladb#22783	2025-02-17 11:34:27 +03:00
Botond Dénes	3439d015cb	Merge 'repair: Introduce Host and DC filter support' from Aleksandra Martyniuk Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. https://github.com/scylladb/scylladb/pull/21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support. Fixes https://github.com/scylladb/scylladb/issues/22417 New feature. No backport is needed. Closes scylladb/scylladb#22621 * github.com:scylladb/scylladb: test: add test to check dcs and hosts repair filter test: add repair dc selection to test_tablet_metadata_persistence repair: Introduce Host and DC filter support docs: locator: update the docs and formatter of tablet_task_info	2025-02-17 10:04:09 +02:00
Kefu Chai	7ff0d7ba98	tree: Remove unused boost headers This commit eliminates unused boost header includes from the tree. Removing these unnecessary includes reduces dependencies on the external Boost.Adapters library, leading to faster compile times and a slightly cleaner codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22857	2025-02-15 20:32:22 +02:00
Raphael S. Carvalho	d78f57e94a	service: Don't use new tablet_resize_finalization state until supported In a rolling upgrade, nodes that weren't upgraded yet will not recognize the new tablet_resize_finalization state, that serves both split and merges, leading to a crash. To fix that, coordinator will pick the old tablet_split_finalization state for serving split finalization, until the cluster agrees on merge, so it can start using the new generic state for resize finalization introduced in merge series. Regression was introduced in `e00798f`. Fixes #22840. Reported-by: Tomasz Grabiec <tgrabiec@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22845	2025-02-15 20:32:22 +02:00
Ferenc Szili	d598750b2d	storage_proxy: allow multiple truncate table fibers per shard In order to allow concurrent truncate table operations (for the time being, only for a single table) we have to remove the limitation allowing only one truncate table fiber per shard. This change adds the ability to collect the active truncate fibers in storage_proxy::remote into std::list<> instead of having just a single truncate fiber. These fibers are waited for completion during storage_proxy::remote::stop().	2025-02-14 12:35:31 +01:00
Abhinav Jha	e491950c47	raft topology: Add support for raft topology system tables initialization to happen before group0 initialization In the current scenario, topology_change_kind variable, was been handled using _manage_topology_change_kind_from_group0 variable. This method was brittle and had some bugs(e.g. for restart case, it led to a time gap between group0 server start and topology_change_kind being managed via group0) Post _manage_topology_change_kind_from_group0 removal, careful management of topology_change_kind variable was needed for maintaining correct topology_change_kind in all scenarios. So this PR also performs a refactoring to populate all init data to system tables even before group0 creation(via raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader happens before the group 0 creation, we write mutations directly to system tables instead of a group 0 command. Hence, post group0 creation, the node can read the correct values from system tables and correct values are maintained throughout. Added a new function initialize_done_topology_upgrade_state which takes care of updating the correct upgrade state to system tables before starting group0 server. This ensures that the node can read the correct values from system tables and correct values are maintained throughout. By moving raft_initialize_discovery_leader logic to happen before starting group0 server, and not as group0 command post server start, we also get rid of the potential problem of init group0 command not being the 1st command on the server. Hence ensuring full integrity as expected by programmer. Fixes: scylladb/scylladb#21114	2025-02-14 16:56:17 +05:30

1 2 3 4 5 ...

5226 Commits