scylladb

Author	SHA1	Message	Date
Piotr Smaron	d4c28690e1	db: fail reads and writes with local consistencty level to a DC with RF=0 When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception. Scylla allowed such read operations and failed write operations with a cryptic: "broken promise" error. This occured because the initial availability check passed (quorum of 0 requires 0 replicas), but execution failed later when no replicas existed to process the mutation. This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that throws before attempting operation execution. The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be upgraded. This testcase was asserting somewhat similar scenario, but wasn't taking into account the whole matrix of combinations: - scenarios: successful vs unsuccesful operation outcome - local consistency levels: LOCAL_QUORUM & LOCAL_ONE - operations: SELECT (read) & INSERT (write) and so it's been extended to cover both the pre-existing and the current issues and the whole matrix of combinations. Fixes: scylladb/scylladb#27893	2026-01-22 12:49:45 +01:00
Piotr Smaron	9475659ae8	db: consistency_level: split `local_quorum_for()` The core of `local_quorum_for()` has been extracted to `get_replication_factor_for_dc()`, which is going to be used later, while `local_quorum_for()` itself has been recreated using the exracted part.	2026-01-22 12:49:23 +01:00
Piotr Smaron	0b3ee197b6	db: consistency_level: fix nrs -> nts abbreviation `network_topology_strategy` was abbreviated with `nrs`, and not `nts`. I think someone incorrectly assumed it's 'network Replication strategy', hence nrs.	2026-01-22 12:48:37 +01:00
Avi Kivity	bd08b6e5b2	Merge 'Unify configuration of object storage endpoints (take 2)' from Pavel Emelyanov To configure S3 storage, one needs to do ``` object_storage_endpoints: - name: s3.us-east-1.amazonaws.com port: 443 https: true aws_region: us-east-1 ``` and for GCS it's ``` object_storage_endpoints: - name: https://storage.googleapis.com:433 type: gs credentials_file: <gcp account credentials json file> ``` This PR updates the S3 part to look like ``` object_storage_endpoints: - name: https://s3.us-east-1.amazonaws.com:443 aws_region: us-east-1 ``` fixes: #26570 This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with. About correctness of the changes. No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111 to show that they are valid and pass before changing the core code. Enhancing the way configuration is made, likely no need to backport. Closes scylladb/scylladb#28112 * github.com:scylladb/scylladb: test: Validate S3 endpoints new format works docs: Update docs according to new endpoints config option format object_storage: Create s3 client with "extended" endpoint name s3/storage: Tune config updating sstable: Shuffle args for s3_client_wrapper test: Rename badconf variable into objconf test: Split the object_store/test_get_object_store_endpoints test	2026-01-14 18:29:03 +02:00
Gleb Natapov	bee5f63cb6	topology coordinator: complete pending operation for a replaced node A replaced node may have pending operation on it. The replace operation will move the node into the 'left' state and the request will never be completed. More over the code does not expect left node to have a request. It will try to process the request and will crash because the node for the request will not be found. The patch checks is the replaced node has peening request and completes it with failure. It also changes topology loading code to skip requests for nodes that are in a left state. This is not strictly needed, but makes the code more robust. Fixes #27990 Closes scylladb/scylladb#28009	2026-01-14 13:11:27 +01:00
Avi Kivity	c6dfae5661	treewide: #include Seastar headers with angle brackets Seastar is an external library from the point of view of ScyllaDB, so should be included with angle brackets. Closes scylladb/scylladb#27947	2026-01-13 14:56:15 +02:00
Pavel Emelyanov	f227de24b2	object_storage: Create s3 client with "extended" endpoint name For this, add the s3::client::make(endpoint, ...) overload that accepts endpoint in proto://host:port format. Then it parses the provided url and calls the legacy one, that accepts raw host string and config with port, https bit, etc. The generic object_storage_endpoint_param no longer needs to carry the internal s3::endpoint_config, the config option parsing changes respectively. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:24:06 +03:00
Avi Kivity	66aee0fb5e	alternator: add optional listeners for proxy protocol v2 Following `954f2cbd2f`, which added proxy protocol v2 listeners for CQL, we do the same for alternator. We add two optional ports for plain and TLS-wrapped HTTP. We test each new port, that the old ports still work, and that mixing up a port with no proxy protocol and a connection with proxy protocol (or the opposite) fails. The latter serves to show that the testing strategy is valid and doesn't just pass whatever happens. We also verify that the correct addresses (and TLS mode) show up in system.clients. Closes scylladb/scylladb#27889	2026-01-13 09:59:24 +02:00
Botond Dénes	04b8f72946	Merge 'repair: Implement auto repair for tablet repair' from Asias He repair: Implement auto repair for tablet repair This patch implements the basic auto repair support for tablet repair. It was decided to add no per table configuration for the initial implementation, so two scylla yaml config options are introduced to set the default auto repair configs for all the tablet tables. - auto_repair_enabled_default Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. - auto_repair_threshold_default_in_seconds Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. The following metrcis are added: - auto_repair_needs_repair_nr The number of tablets with auto repair enabled that needs repair - auto_repair_enabled_nr The number of tablets with auto repair enabled The metrics are useful to tell if auto repair is falling behind. In the future, more auto repair scheduling will be added, e.g., scheduling based on the repaired and unrepaired sstable set size, tombstone ratio and so on, in addition to the time based scheduling. Fixes SCYLLADB-99 New feature. No backport. Closes scylladb/scylladb#27534 * github.com:scylladb/scylladb: topology_coordinator: Add metrics for tablet repair repair: Implement auto repair for tablet repair	2026-01-12 14:16:01 +02:00
Petr Gusev	889d7782ed	treewide: use coroutine::maybe_yield in coroutines It's more efficient since coroutine::maybe_yield returns a lightweight struct (awaitable), not the future. Closes scylladb/scylladb#28101	2026-01-12 10:38:47 +01:00
Alex	e430065c92	db: views: serialize create/drop view operations via shard 0 Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build` A problematic sequence looks like this: * `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created * `on_drop_view()` runs on shard 0 → entry for shard 0 is removed * `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again * `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI. This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating. new process: - view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background. - dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That: - writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...). - After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...). - Each shard runs handle_create_view_local(...), which: - waits for pending base writes/streams, flushes the base, - resets the reader to the current token and adds the new view, - handles errors and triggers _build_step to continue processing. Drop view - view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background. - dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...). - Each shard runs handle_drop_view_local(...), which: - removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps, - ignores missing keyspace cases. - After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which: - removes global build progress, built‑view state, and view build status in system tables, Shutdown - drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted. In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness. Fixes: https://github.com/scylladb/scylladb/issues/27898 Backport: not required as the problem happens on master Closes scylladb/scylladb#27929	2026-01-12 09:23:22 +02:00
Calle Wilund	a7cdb602e1	db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc Fixes #27992 When doing a commit log oversized allocation, we lock out all other writers by grabbing the _request_controller semaphore fully (max capacity). We thereafter assert that the semaphore is in fact zero. However, due to how things work with the bookkeep here, the semaphore can in fact become negative (some paths will not actually wait for the semaphore, because this could deadlock). Thus, if, after we grab the semaphore and execution actually returns to us (task schedule), new_buffer via segment::allocate is called (due to a non-fully-full segment), we might in fact grab the segment overhead from zero, resulting in a negative semaphore. The same problem applies later when we try to sanity check the return of our permits. Fix is trivial, just accept less-than-zero values, and take same possible ltz-value into account in exit check (returning units) Added whitebox (special callback interface for sync) unit test that provokes/creates the race condition explicitly (and reliably). Closes scylladb/scylladb#27998	2026-01-09 14:06:58 +02:00
Asias He	7ba7b25bdd	repair: Implement auto repair for tablet repair This patch implements the basic auto repair support for tablet repair. It was decided to add no per table configuration for the initial implementation, so two scylla yaml config options are introduced to set the default auto repair configs for all the tablet tables. - auto_repair_enabled_default Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. - auto_repair_threshold_default_in_seconds Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. The following metrcis are added: - auto_repair_needs_repair_nr The number of tablets with auto repair enabled that needs repair - auto_repair_enabled_nr The number of tablets with auto repair enabled The metrics are useful to tell if auto repair is falling behind. In the future, more auto repair scheduling will be added, e.g., scheduling based on the repaired and unrepaired sstable set size, tombstone ratio and so on, in addition to the time based scheduling. Fixes SCYLLADB-99	2026-01-09 16:11:39 +08:00
Botond Dénes	60570d7114	Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations: * Altering a keyspace's RF if it would make the keyspace RF-rack-invalid * Adding a node in a new rack * Removing / Decommissioning the last node in a rack Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views. The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid. Fixes scylladb/scylladb#23345 Fixes https://github.com/scylladb/scylladb/issues/26820 backport to relevant versions for materialized views with tablets since it depends on rf-rack validity Closes scylladb/scylladb#26354 * github.com:scylladb/scylladb: docs: update RF-rack restrictions cql3: don't apply RF-rack restrictions on vector indexes cql3: add warning when creating mv/index with tablets about rf-rack service/tablet_allocator: always allow tablet merge of tables with views locator: extend rf-rack validation for rack lists test: test rf-rack validity when creating keyspace during node ops locator: fix rf-rack validation during node join/remove test: test topology restrictions for views with tablets test: add test_topology_ops_with_rf_rack_valid topology coordinator: restrict node join/remove to preserve RF-rack validity topology coordinator: add validation to node remove locator: extend rf-rack validation functions view: change validate_view_keyspace to allow MVs if RF=Racks db: enforce rf-rack-validity for keyspaces with views replica/db: add enforce_rf_rack_validity_for_keyspace helper db: remove enforce parameter from check_rf_rack_validity test: adjust test to not break rf-rack validity	2026-01-09 10:01:23 +02:00
Michael Litvak	8f15c7a874	db/view/view_update_generator: move discover_staging_sstables to start Call discover_staging_sstables in view_update_generator::start() instead of in the constructor, because the constructor is called during initialization before sstables are loaded. The initialization order was changed in `5d1f74b86a` and caused this regression. It means the view update generator won't discover staging sstables on startup and view updates won't be generated for them. It also causes issues in sstable cleanup. view_update_generator::start() is called in a later stage of the initialization, after sstable loading, so do the discovery of staging sstables there. Fixes scylladb/scylladb#27956 Closes scylladb/scylladb#27970	2026-01-08 21:55:19 +02:00
Asias He	4f77dd058d	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#27679	2026-01-08 21:55:18 +02:00
Nadav Har'El	5f79d93102	Merge 'Alternator response compression' from Szymon Malewski This pull request introduces HTTP response compression to Alternator, allowing responses (both string and chunked) to be compressed using `gzip` or `deflate` when requested by clients and when the response size exceeds configurable thresholds. * Added new source files `http_compression.cc` and `http_compression.hh` implementing compression logic, including parsing client `Accept-Encoding` headers, selecting compression algorithms, and compressing response bodies using zlib. * Added two new configuration options to `db::config` (`alternator_response_gzip_compression_level` and `alternator_response_gzip_compression_threshold_in_bytes`) to control compression level (and optionally disable compression with level 0 - no compression) and minimum response size for compression. * Added tests showing compliance with DynamoDB behavior. Fixes #27246 New feature - no backporting Closes scylladb/scylladb#27454 * github.com:scylladb/scylladb: alternator/http_compression: Add compression of streamed response alternator/http_compression: Add implementation od gzip/deflate of string response alternator/http_compression: Add handling of Accept-Encoding header test/alternator: add tests for compressed responses	2026-01-06 16:47:11 +02:00
Szymon Malewski	ec329f85b0	alternator/http_compression: Add handling of Accept-Encoding header This is an initial patch to add support of Alternator's compressed responses. The actual compression (gzip,deflate) will be added in the following commits. The main functionality added in this commmit is parsing of Accept-Encoding header, that indicates compression algorithms supported by the client. In this commit we add also configuration parameters of response gzip/deflate compression. They allow to enable/disable compression, set level and a size threshold below which a response is not compressed. With current implementation it is possible to decide a compression for each response, but it is not used yet.	2026-01-05 10:14:40 +01:00
Avi Kivity	0df85c8ae8	Revert "Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov" This reverts commit `1bb897c7ca`, reversing changes made to `954f2cbd2f`. It makes incompatible changes to the object storage configuration format, breaking tests [1]. It's likely that it doesn't break any production configuration, but we can't be sure. Fixes #27966 Closes scylladb/scylladb#27969	2026-01-05 08:53:41 +02:00
Avi Kivity	e03d24e3f3	Merge 'Use file_stat with a relative path when listing directories' from Benny Halevy With the additional file_stat overload introduced in [Update seastar submodule](`3e9b071838`), use the opened directory for more efficient, relative-path based stat. * Enhancement, no backport needed Closes scylladb/scylladb#27967 * github.com:scylladb/scylladb: table: get_snapshot_details: use relative-path based file_stat table: get_snapshot_details: fix warning in exists_in_dir table: get_snapshot_details: fix staging dir calculation backup: process_snapshot_dir: use relative-path based file_stat directory_lister: add ctor with opened directory	2026-01-04 22:06:34 +02:00
Dawid Mędrek	77a934e5b9	db/hints: Prevent draining hints before hint replay is allowed Context ------- The procedure of hint draining boils down to the following steps: 1. Drain a hint sender. That should get rid of all hints stored for the corresponding endpoint. 2. Remove the hint directory corresponding to that endpoint. Obviously, it gets more complex than this high-level perspective. Without blurring the view, the relevant information is that step 1 in the algorithm above may not be executed. Breaking it down, it comprises of two calls to `hint_sender::send_hints_maybe()`. The function is responsible for sending out hints, but it's not unconditional and will not be performed if any of the following bullets is not satisfied: * `hint_sender::replay_allowed()` is not `true`. This can happen when hint replay hasn't been turned on yet. * `hint_sender::can_send()` is not `true`. This can happen if the corresponding endpoint is not alive AND it hasn't left the cluster AND it's still a normal token owner. There is one more relevant point: sending hints can be stopped if replaying hints fails and `hint_sender::send_hints_maybe()` returns `false`. However, that's not not possible in the case of draining. In that case, if Scylla comes across any failure, it'll simply delete the corresponding hint segment. Because of that, we ignore it and only focus on the two bullets. --- Why is it a problem? -------------------- If a hint directory is not purged of all hint segments in it, any attempt to remove it will fail and we'll observe an error like this: ``` Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [<path>]) ``` The folder with the remaining hints will also stay on disk, which is, of course, undesired. --- When can it happen? ------------------- As highlighted in the Context section of this commit message, the key part of the code that can lead to a dangerous situation like that is `hint_sender::send_hints_maybe()`. The function is called twice when draining a hint endpoint manager: once to purge all of the existing hints, and another time after flushing all hints stored in a commitlog instances, but not listed by `hint_sender` yet. If any of those calls misbehaves, we may end up with a problem. That's why it's crucial to ensure that the function always goes through ALL of the hints. Dangerous situations: 1. We try to drain hints before hint replay is allowed. That will violate the first bullet above. 2. The node we're draining is dead, but it hasn't left the cluster, and it still possesses some tokens. --- How do we solve that? --------------------- Hint replay is turned on in `main.cc`. Once enabled, it cannot be disabled. So to address the first bullet above, it suffices to ensure that no draining occurs beforehand. It's perfectly fine to prevent it. Soon after hint replay is allowed, `main.cc` also asks the hint manager to drain all of the endpoint managers whose endpoints are no longer normal token owners (cf. `db::hints::manager::drain_left_nodes()`). The other bullet is more tricky. It's important here to know that draining only initiated in three situations: 1. As part of the call to `storage_service::notify_left()`. 2. As part of the call to `storage_service::notify_released()`. 3. As part of the call to `db::hints::manager::drain_left_nodes()`. The last one is trivially non-problematic. The nodes that it'll try to drain are no longer normal token owners, so `can_send()` must always return `true`. The second situation is similar. As we read in the commit message of scylladb/scylladb@eb92f50413, which introduced the notion of released nodes, the nodes are no longer normal token owners: > In this patch we postpone the hint draining for the "left" nodes to > the time when we know that the target nodes no longer hold ownership > of any tokens - so they're no longer referenced in topology. I'm > calling such nodes "released". I suggest reading the full commit message there because the problems there are somewhat similar these changes try to solve. Finally, the first situation: unfortunately, it's more tricky. The same commit message says: > When a node is being replaced, it enters a "left" state while still > owning tokens. Before this patch, this is also the time when we start > draining hints targeted to this node, so the hints may get sent before > the token ownership gets migrated to another replica, and these hints > may get lost. This suggests that `storage_service::notify_left()` may be called when the corresponding node still has some tokens! That's something that may prevent properly draining hints. Fortunately, no hope is lost. We only drain hints via `notify_left()` when hinted handoff hasn't been upgraded to being host-ID-based yet. If it has, draining always happens via `notify_released()`. When I write this commit message, all of the supported versions of Scylla 2025.1+ use host-ID-based hinted handoff. That means that problems can only arise when upgrading from an older version of Scylla (2024.1 downwards). Because of that, we don't cover it. It would most likely require more extensive changes. --- Non-issues ---------- There are notions that are closely related to sending hints. One of them is the host filter that hinted handoff uses. It decides which endpoints are eligible for receiving hints, and which are not. Fortunately, all endpoints rejected by the host filter lose their hint endpoint managers -- they're stopped as part of that procedure. What's more, draining hints and changing the host filter cannot be happening at the same time, so it cannot lead to any problems. The solution ------------ To solve the described issue, we simply prevent draining hints before hint replay is allowed. No reproducer test is attached because it's not feasible to write one. Fixes scylladb/scylladb#27693 Closes scylladb/scylladb#27713	2026-01-04 16:54:05 +02:00
Benny Halevy	1a08ef2062	backup: process_snapshot_dir: use relative-path based file_stat With the additional file_stat overload introduced in `3e9b071838`, use the opened directory for more efficient, relative-path based stat. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-04 11:05:56 +02:00
Avi Kivity	853f3dadda	Merge 'treewide: fix some spelling errors' from Piotr Smaron Irritated by prevailing spellchecker comments attached to every PR, I aim to fix them all. No need to backport, just cosmetic changes. Closes scylladb/scylladb#27897 * github.com:scylladb/scylladb: treewide: fix some spelling errors codespell: ignore `iif` and `tread`	2025-12-29 20:45:31 +02:00
Tomasz Grabiec	bbf9ce18ef	Merge 'load_balancer: compute node load based on tablet sizes' from Ferenc Szili Currently, the tablet load balancer performs capacity based balancing by collecting the gross disk capacity of the nodes, and computes balance assuming that all tablet sizes are the same. This change introduces size-based load balancing. The load balancer does not assume identical tablet sizes any more, and computes load based on actual tablet sizes. The size-based load balancer computes the difference between the most and least loaded nodes in the balancing set (nodes in DC, or nodes in a rack in case of `rf-rack-valid-keyspaces`) and stops further balancing if this difference is bellow the config option `size_based_balance_threshold_percentage`. This config option does not apply to the absolute load, but instead to the percentage of how much the most loaded node is more loaded than the least loaded node: `delta = (most_loaded - least_loaded) / most_loaded` If this delta is smaller then the config threshold, the balancer will consider the nodes balanced. This PR is a part of a series of PRs which are based on top of each other. - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 - The fifth part changes the load balancing simulator: #26438 This is a new feature, backport is not needed. Fixes #26254 Closes scylladb/scylladb#26254 * github.com:scylladb/scylladb: test, load balancing: add test for table balance load_balancer: add cluster feature for size based balancing load_balancer: implement size-based load balancing config: add size based load balancing config params load_stats: use trinfo to decide how to reconcile tablet size load_sketch: use tablet sizes in load computation load_stats: add get_tablet_size_in_transition()	2025-12-29 15:01:38 +01:00
Piotr Smaron	fb4d89f789	treewide: fix some spelling errors	2025-12-29 13:53:56 +01:00
Nadav Har'El	8df9cfcde8	Merge 'Add table size bytes to describe table' from Radosław Cybulski Add table size to DescribeTable's reply in Alternator Fills DescribeTable's reply with missing field TableSizeBytes. - add helper class simple_value_with_expiry, which is like std::optional but the value put has a timeout. - add ignore_errors to estimate_total_sstable_volume function - if set to true the function will catch errors during RPC and ignore them, substituting 0 for missing value. - add a reference to storage_service to executor class (needed to call estimate_total_sstable_volume function). - add fill_table_description and create_table_on_shard0 as non static methods to executor class - calculate TableSizeBytes value for a given table and return it as part of DescribeTable's return value. The value calculated is cached for approximately 6 hours (as per DescribeTable's specification). The algorithm is as follows: - if the requested value is in cache and is still valid it's returned, nothing else happens. - otherwise: - every shard of every node is requested to calculate size of its data - if the error happens, the error is ignored and we assume the given shard has a size of 0 - all such values are summed producing total size - produced value is returned to a caller - on the node the call for a size happened every shard is requested to cache produced value with a 6 hour timeout. - if the next call comes for a differet shard on the same node that doesn't yet have cached value, the shard will request the value to be calculated again. The new value will overwrite the old one on every shard on this node. - if the next call comes to a different node, the process of calculation will happen from start, possibly producing different value. The value will have it's own timeout, there's no attempt made to synchronize value between nodes. - add a alternator_describe_table_info_timeout_in_seconds parameter, which will control, how long DescribeTable's table information are being held in cache. Default is 6 hours. - update test to use parameter `alternator_describe_table_info_timeout_in_seconds` - setting it to 0 and forcing flushing memtables to disk allows checking, that table size has grown. Fixes #7551 Closes scylladb/scylladb#24634 * github.com:scylladb/scylladb: alternator: fix invalid rebase Update tests Update documentation Add table size to DescribeTable's output Promote fill_table_description and create_table_on_shard0 to methods Modify estimate_total_sstable_volume to opt ignore errors Add alternator_describe_table_info_cache_validity_in_seconds config option Add ref to service::storage_service to executor Add simple_value_with_expiry util class	2025-12-29 14:47:36 +02:00
Benny Halevy	f60033db63	db: system_keyspace: get_group0_history: unfreeze_gently Prevent stall when the group0 history is too long using unfreeze_gently rather than the synchronous unfreeze() function Fixes #27872 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27873	2025-12-29 12:00:54 +02:00
Radosław Cybulski	a532fc73bc	Add alternator_describe_table_info_cache_validity_in_seconds config option Add a `alternator_describe_table_info_cache_validity_in_seconds` configuration option with default value of 6 hours.	2025-12-29 08:33:05 +01:00
Pavel Emelyanov	2e33234e91	util: Remove lister::rmdir() There's seastar helper that does the same, no need to carry yet another implementation Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27851	2025-12-28 19:46:19 +02:00
Ferenc Szili	cc9e125f12	config: add size based load balancing config params This change adds: - The config paremeter force_capacity_based_balancing which, when enabled performs capacity based balancing instead of size based. - The config parameter size_based_balance_threshold_percentage which sets the balance threshold for the size based load balancer. - The config parameter minimal_tablet_size_for_balancing which sets the minimal tablet size for the load balancer.	2025-12-27 10:37:38 +01:00
Ferenc Szili	621cb19045	load_sketch: use tablet sizes in load computation This commit changes load_sketch so that it computes node and shard load based on tablet sizes instead of tablet count.	2025-12-27 10:37:23 +01:00
Botond Dénes	27bf65e77a	db/batchlog_manager: add missing <seastar/coroutine/parallel_for_each.hh> include Build only fails if `--disable-precompiled-header` is passed to `configure.py`. Not sure why. Closes scylladb/scylladb#27721	2025-12-24 16:32:12 +02:00
Botond Dénes	1bb897c7ca	Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov To configure S3 storage, one needs to do ``` object_storage_endpoints: - name: s3.us-east-1.amazonaws.com port: 443 https: true aws_region: us-east-1 ``` and for GCS it's ``` object_storage_endpoints: - name: https://storage.googleapis.com:433 type: gs credentials_file: <gcp account credentials json file> ``` This PR updates the S3 part to look like ``` object_storage_endpoints: - name: https://s3.us-east-1.amazonaws.com:443 aws_region: us-east-1 ``` fixes: #26570 Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised. Closes scylladb/scylladb#27360 * github.com:scylladb/scylladb: object_storage: Temporarily handle pure endpoint addresses as endpoints code: Remove dangling mentions of s3::endpoint_config docs: Update docs according to new endpoints config option format object_storage: Create s3 client with "extended" endpoint name test: Add named constants for test_get_object_store_endpoints endpoint names s3/storage: Tune config updating sstable: Shuffle args for s3_client_wrapper	2025-12-24 06:59:02 +02:00
Botond Dénes	954f2cbd2f	Merge 'config, transport: add listeners for native protocol fronted by proxy protocol v2' from Avi Kivity For deployments fronted by a reverse proxy (haproxy or privatelink), we want to use proxy protocol v2 so that client information in system.clients is correct and so that the shard-aware selection protocol, which depends on the source port, works correctly. Add proxy-protocol enabled variants of each of the existing native transport listeners. Tests are added to verify this works. I also manually tested with haproxy. New feature, no backport. Closes scylladb/scylladb#27522 * github.com:scylladb/scylladb: test: add proxy protocol tests config, transport: support proxy protocol v2 enhanced connections	2025-12-24 06:58:00 +02:00
Michał Hudobski	ce3320a3ff	auth: add system table permissions to VECTOR_SEARCH_INDEXING Due to the recent changes in the vector store service, the service needs to read two of the system tables to function correctly. This was not accounted for when the new permission was added. This patch fixes that by allowing these tables (group0_history and versions) to be read with the VECTOR_SEARCH_INDEXING permission. We also add a test that validates this behavior. Fixes: SCYLLADB-73 Closes scylladb/scylladb#27546	2025-12-23 15:53:07 +02:00
Avi Kivity	7586c5ccbd	Merge 'system.clients: add `client_options` map column' from Vladislav Zolotarov This pull request introduces a new caching mechanism for client options in the Alternator and transport layers, refactors how client metadata is stored and accessed, and extends the `system.clients` virtual table to surface richer client information. The changes improve efficiency by deduplicating commonly used strings (like driver names/versions and client options), and ensure that client data is handled in a way that's safe for cross-shard access. Additionally, the test suite and virtual table schema are updated to reflect the new client options data. Caching and client metadata refactoring: * The largest and most repeatable items in the connection state before this PR were a `driver_name` and a `driver_version` which were stored as an `sstring` object which means that the corresponding memory consumption was 16 bytes per each such value at least (the smallest size of the `seastar`'s `sstring` object) per-connection. In reality the driver name is usually longer than 15 characters, e.g. "ScyllaDB Python Driver" is 23 characters and this is not the longest driver name there is. In such cases the actual memory usage of a corresponding `sstring` object jumps to 8 + 4 + 1 + (string length, 23 in our example) + 1. So, for "ScyllaDB Python Driver" it would be 37 bytes (in reality it would be a bit more due to natural alignment of other allocations since the `contents` size is not well aligned (13 bytes), but let's ignore this for now). * These bytes add up quickly as there are more connections and, sometimes we are talking about millions of connections per-shard. * Using a smart pointer (`lw_shared_ptr`) referencing a corresponding cached value will effectively reduce the per-connection memory usage to be 8 bytes (a size of a pointer on 64-bit CPU platform) for each such value. While storing a corresponding `sstring` value only once. * This will would reduce the "variable" (per-connection) memory usage by at least 50%. And in case of "ScyllaDB Python Driver" driver version - by 78%! * And all this for a price of a single `loading_shared_values` object per-shard (implements a hash table) and a minor overhead for each value stored in it. * Introduced a new cache type (`client_options_cache_type`) for deduplicating and sharing client option strings, and refactored `client_data`, `client_state`, and related classes to use `foreign_ptr<std::unique_ptr<client_data>>` and cached entry types for fields like driver name, driver version, and client options. (`client_data.hh`, `service/client_state.hh`, `alternator/server.hh`, `alternator/controller.hh`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fR33-R36) [[2]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fL40-R56) [[3]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL105-R107) [[4]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[5]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL91-R92) [[6]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[7]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[8]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[9]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[10]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[11]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48) * Updated the methods for setting and getting driver name, driver version, and client options in `client_state` to be asynchronous and use the new cache. (`service/client_state.hh`, `service/client_state.cc`) [[1]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[2]](diffhunk://#diff-99634aae22e2573f38b4e2f050ed2ac4f8173ff27f0ae8b3609d1f0cc1aeb775R347-R362) Virtual table and API enhancements: * Extended the `system.clients` virtual table schema and implementation to include a new `client_options` column (a map of option key/value pairs), and updated the table population logic to use the new cached types and foreign pointers. (`db/virtual_tables.cc`) [[1]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1R752) [[2]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L769-R770) [[3]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L809-R816) [[4]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L828-R879) API and interface changes: * Changed the signatures of `get_client_data` methods throughout the codebase to return vectors of `foreign_ptr<std::unique_ptr<client_data>>` instead of plain `client_data` objects, to ensure safe cross-shard access. (`alternator/controller.hh`, `alternator/controller.cc`, `alternator/server.hh`, `alternator/server.cc`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[2]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[3]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[4]](diffhunk://#diff-a7e2cda866c03a75afcf3b087de1c1dcd2e7aa996214db67f9a11ed6451e596dL988-R995) [[5]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[6]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[7]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48) Testing and validation: * Updated the Python test for the `system.clients` table to verify the new `client_options` column and its contents, ensuring that driver name and version are present in the options map. (`test/cqlpy/test_virtual_tables.py`) [[1]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R79) [[2]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R88-R90) Closes scylladb/scylladb#25746 * github.com:scylladb/scylladb: transport/server: declare a new "CLIENT_OPTIONS" option as supported service/client_state and alternator/server: use cached values for driver_name and driver_version fields system.clients: add a client_options column controller: update get_client_data to use foreign_ptr for client_data	2025-12-22 20:02:40 +02:00
Pavel Emelyanov	e304d912b4	Merge 'db/view/view_building_worker: follow-ups' from Michał Jadwiszczak This patch consists of a few smaller follow-ups to the view building worker: - catch general execption in staging task registrator - remove unnecessary CV broadcast - don't pollute function context with conditionally compiled variable - avoid creating a copy of tasks map - fix some typos Refs https://github.com/scylladb/scylladb/issues/25929 Refs https://github.com/scylladb/scylladb/pull/26897 This PR doesn't fix any bugs but recently we're backporting some PRs to 2025.4, so let's also backport this one to avoid painful conflicts. Closes scylladb/scylladb#26558 * github.com:scylladb/scylladb: docs/dev/view-building-coordinator: fix typos db/view/view_building_worker: remove unnnecessary empty lines db/view/view_building_worker: fix typo db/view/view_building_worker: avoid creating a copy of tasks map db/view/view_building_worker: wrap conditionally compiled code in a scope db/view/view_building_worker: remove unnecessary CV broadcast db/view/view_building_worker: catch general execption in staging task registrator	2025-12-22 20:02:40 +02:00
Michael Litvak	8df61f6d99	view: change validate_view_keyspace to allow MVs if RF=Racks The function validate_view_keyspace checks if a keyspace is eligible for having materialized views, and it is used for validation when creating a MV or a MV-based index. Previously, it was required that the rf_rack_valid_keyspaces option is set in order for tablets-based keyspaces to be considered eligible, and the RF-rack condition was enforced when the option is set. Instead of this, we change the validation to allow MVs in a keyspace if the RF-rack condition is satisfied for the keyspace - regardless of the config option. We remove the config validation for views on startup that validates the option `rf_rack_valid_keyspaces` is set if there are any views with tablets, since this is not required anymore. We can do this without worrying about upgrades because this change will be effective from 2025.4 where MVs with tablets are first out of experimental phase. We update the test for MV and index restrictions in tablets keyspaces according to the new requirements. * Create MV/index: previously the test checked that it's allowed only if the config option `rf_rack_valid_keyspaces` is set. This is changed now so it's always allowed to create MV/index if the keyspace is RF-rack-valid. Update the test to verify that we can create MV/index when the keyspace is RF-rack-valid, even if the rf_rack option is not set, and verify that it fails when the keyspace is RF-rack-invalid. * Alter: Add a new test to verify that while a keyspace has views, it can't be altered to become RF-rack-invalid.	2025-12-22 09:14:29 +01:00
Vlad Zolotarov	28cbaef110	service/client_state and alternator/server: use cached values for driver_name and driver_version fields Optimize memory usage changing types of driver_name and driver_version be a reference to a cached value instead of an sstring. These fields very often have the same value among different connections hence it makes sense to cache these values and use references to them instead of duplicating such strings in each connection state. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2025-12-20 12:26:22 -05:00
Vlad Zolotarov	85adf6bdb1	system.clients: add a client_options column This new column is going to contain all OPTIONS sent in the STARTUP frame of the corresponding CQL session. The new column has a `frozen<map<text, text>>` type, and we are also optimizing the amount of required memory for storing corresponding keys and values by caching them on each shard level. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2025-12-20 12:26:15 -05:00
Vlad Zolotarov	3a54bab193	controller: update get_client_data to use foreign_ptr for client_data get_client_data() is used to assemble `client_data` objects from each connection on each CPU in the context of generation of the `system.clients` virtual table data. After collected, `client_data` objects were std::moved and arranged into a different structure to match the table's sorting requirements. This didn't allow having not-cross-shard-movable objects as fields in the `client_data`, e.g. lw_shared_ptr objects. Since we are planning to add such fields to `client_data` in following patches this patch is solving the limitation above by making get_client_data() return `foreign_ptr<std::unique_ptr<client_data>>` objects instead of naked `client_data` ones. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2025-12-19 11:01:41 -05:00
Patryk Jędrzejczak	d5c205194b	Merge 'topology: Make removenode use left_token_ring state for global barrier' from Emil Maskovsky Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier. Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests. Key changes: - Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler - In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes are already dead) - Updated documentation to reflect that both operations use this state This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion. The change is protected by "REMOVENODE_WITH_LEFT_TOKEN_RING" feature flag to also support mixed clusters during e.g. upgrade. Fixes: scylladb/scylladb#25530 No backport: This fixes and issue found in tests. It can theoretically happen in production too, but wasn't reported in any customer issue, so a backport is not needed. Closes scylladb/scylladb#26931 * https://github.com/scylladb/scylladb: topology: make removenode use left_token_ring state for global barrier topology: allow removing nodes not having tokens features: add feature flag for removenode via left token ring	2025-12-18 09:34:38 +01:00
Emil Maskovsky	9431826c52	topology: allow removing nodes not having tokens For the changes to go through the left_token_ring state when REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled, we need to allow removing nodes to not have any tokens (similarly to decommissioning nodes, which use the same sequence of states). This means the tests also need to change to allow for this new behavior - it can temporarily happen that a removing node has no tokens but is still part of Raft group 0 (so there may be a temporary mismatch between the token ring and group 0 membership). Therefore, the `check_token_ring_and_group0_consistency` function is replaced by `wait_for_token_ring_and_group0_consistency`, which waits up to 30 seconds for consistency to be reached.	2025-12-17 13:31:11 +01:00
Avi Kivity	1382b47d45	config, transport: support proxy protocol v2 enhanced connections We have four native transport ports: two for plain/TLS, and two more for shard-aware (plain/TLS as well). Add four more that expect the proxy protocol v2 header. This allows nodes behind a reverse proxy to record the correct source address and port in system.clients, and the shard-aware port to see the correct source port selection made my the client.	2025-12-17 14:18:04 +02:00
Pavel Emelyanov	a6618f225c	object_storage_endpoint_param: Make it formattable for real Currently the formatter converts it to json and then tries to emit into the output context with the "...{{}}" format string. The intent was to have the "...{<json text>}" output. However, the double curly brace in format string means "print a curly brace", so the output of the above formatting is "...{}", literally. Fix by keeping a single curly brace. The "<json text>" thing will have its own surrounding curly braces. Fixes #27718 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27687	2025-12-17 11:48:39 +01:00
Tomasz Grabiec	c077283352	Merge 'service: support conversion of tablet keyspaces to rack-list using ALTER KEYSPACE' from Aleksandra Martyniuk If a keyspace has a numeric replication factor in a DC and rf < #racks, then the replicas of tablets in this keyspace can be distributed among all racks in the DC (different for each tablet). With rack list, we need all tablet replicas to be placed on the same racks. Hence, the conversion requires tablet co-location. After this series, the conversion can be done using ALTER KEYSPACE statement. The statement that does this conversion in any DC is not allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks each and a keyspace ks then with a single ALTER KEYSPACE we can do: - {dc1 : 2} -> {dc1 : [r1, r2]}; - {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: [r2,r3]}; - {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2} - {dc1 : 2} -> {dc1 : 2, dc2 : [r1]} But we cannot do: - {dc1 : 2} -> {dc1 : [r1, r2, r3]}; - {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1]. In order to do the co-locations rf change request is paused. Tablet load balancer examines the paused rf change requests and schedules necessary tablet migrations. During the process of co-location, no other cross-rack migration is allowed. Load balancer checks whether any paused rf change request is ready to be resumed. If so, it puts the request back to global topology request queue. While an rf change request for a keyspace is running, any other rf change of this keyspace will fail. Fixes: #26398. New feature, no backport Closes scylladb/scylladb#27279 * github.com:scylladb/scylladb: test: add est_rack_list_conversion_with_two_replicas_in_rack test: test creating tablet_rack_list_colocation_plan test: add test_numeric_rf_to_rack_list_conversion test tasks: service: add global_topology_request_virtual_task cql3: statements: allow altering from numeric rf to rack list service: topology_coordinator: pause keyspace_rf_change request service: implement make_rack_list_colocation_plan service: add tablet_rack_list_colocation_plan cql3: reject concurrent alter of the same keyspace test: check paused rf change requests persistence db: service: add paused_rf_change_requests to system.topology service: pass topology and system_keyspace to load_balancer ctor service: tablet_allocator: extract load updates service: tablet_allocator: extract ensure_node tasks, system_keyspace: Introduce get_topology_request_entry_opt() node_ops: Drop get_pending_ids() node_ops: Drop redundant get_status_helper()	2025-12-17 10:05:06 +01:00
Tomasz Grabiec	7bc59e93b2	Fix lambda-coroutine fiasco in hint_endpoint_manager.cc Found by copilot. No issue was observed yet. Fixes #27520 Closes scylladb/scylladb#27477	2025-12-16 20:16:41 +03:00
Aleksandra Martyniuk	9039dfa4a5	tasks: service: add global_topology_request_virtual_task Add a service::topo::global_topology_request_virtual_task, which covers the replication factor changes. Currently, the global_topology_request_virtual_task can be aborted only if it is paused. The progress of the rf change isn't counted.	2025-12-16 13:31:22 +01:00
Aleksandra Martyniuk	08e5f35527	db: service: add paused_rf_change_requests to system.topology In the following changes, we allow to alter from numeric rf to rack list. Before the alter, two tablets of the same keyspace can have replicas on different racks. To switch to rack list, we need to co-locate the replicas. It will be achieved by pausing the keyspace_rf_change and scheduling migrations. We need to persist the ids of requests that are paused. A new column - paused_rf_change_requests is added to system.topology table. In this commit no data is kept in the new column.	2025-12-16 13:25:38 +01:00
Tomasz Grabiec	71e6ef90f4	tasks, system_keyspace: Introduce get_topology_request_entry_opt() It's a cleanup. Better to return std::nullopt than faking an entry with an id when require_entry == false.	2025-12-16 13:25:34 +01:00

1 2 3 4 5 ...

4714 Commits