scylladb

Author	SHA1	Message	Date
Karol Nowacki	22a133df9b	service/vector_store_client: Add live configuration update support Enable runtime updates of vector_store_uri configuration without requiring server restart. This allows to dynamically enable, disable, or switch the vector search node endpoint on the fly.	2025-08-12 08:12:53 +02:00
Nadav Har'El	a896e2dbb9	alternator: add optional support for writing to system table In commit `44a1daf` we added the ability to read system tables through the DynamoDB API (actually, the Scan and Query requests only). This ability is useful for tests, and can also be useful to users who want to read information that is only available through system tables. This patch adds support also for writing into system tables. This will be useful for Alternator tests, were we want to temporarily change some live-updatable configuration option - and so far haven't been able to do that like we did do in some cql-pytest tests. For reasons explained in issue #23218, only superuser roles are allowed to write to system tables - it is not enough for the role to be granted MODIFY permissions on the system table or on ALL KEYSPACES. Moreover, the ability to modify system tables carries special risks, so this patch only allows writes to the system tables if a new configuration option "alternator_allow_system_table_write" turned on. This option is turned off by default. This patch also includes a test for this new configuration-writing capability. The test scripts test/alternator/run and test.py now run Scylla with alternator_allow_system_table_write turned on, but the new test can also run without this option, and will be skipped in that case (to allow running the test suite against some manually- run instance of Scylla). Fixes: #12348 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-06 10:00:04 +03:00
Piotr Dulikowski	ec7832cc84	Merge 'Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Patryk Jędrzejczak The following steps are performed in sequence as part of the Raft-based recovery procedure: - set `recovery_leader` to the host ID of the recovery leader in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - perform a rolling restart (with the recovery leader being restarted first). These steps are not intuitive and more complicated than they could be. In this PR, we simplify these steps. From now on, we will be able to simply set `recovery_leader` on each node just before restarting it. Apart from making necessary changes in the code, we also update all tests of the Raft-based recovery procedure and the user-facing documentation. Fixes scylladb/scylladb#25015 The Raft-based procedure was added in 2025.2. This PR makes the procedure simpler and less error-prone, so it should be backported to 2025.2 and 2025.3. Closes scylladb/scylladb#25032 * github.com:scylladb/scylladb: docs: document the option to set recovery_leader later test: delay setting recovery_leader in the recovery procedure tests gossip: add recovery_leader to gossip_digest_syn db: system_keyspace: peers_table_read_fixup: remove rows with null host_id db/config, gms/gossiper: change recovery_leader to UUID db/config, utils: allow using UUID as a config option	2025-08-04 08:29:32 +02:00
Taras Veretilnyk	1d6808aec4	topology_coordinator: Make tablet_load_stats_refresh_interval configurable This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds' that allows overriding the default value without using error injection. Fixes scylladb/scylladb#24641 Closes scylladb/scylladb#24746	2025-07-31 14:31:55 +03:00
Sergey Zolotukhin	ea311be12b	generic_server: Two-step connection shutdown. When shutting down in `generic_server`, connections are now closed in two steps. First, only the RX (receive) side is shut down. Then, after all ongoing requests are completed, or a timeout happened the connections are fully closed. Fixes scylladb/scylladb#24481	2025-07-28 10:08:06 +02:00
Ran Regev	db4f301f0c	scylla.yaml: add recommended value for stream_io_throughput_mb_per_sec Fixes: #24758 Updated scylla.yaml and the help for scylla --help Closes scylladb/scylladb#24793	2025-07-25 10:45:32 +03:00
Patryk Jędrzejczak	445a15ff45	db/config, gms/gossiper: change recovery_leader to UUID We change the type of the `recovery_leader` config parameter and `gossip_config::recovery_leader` from sstring to UUID. `recovery_leader` is supposed to store host ID, so UUID is a natural choice. After changing the type to UUID, if the user provides an incorrect UUID, parsing `recovery_leader` will fail early, but the start-up will continue. Outside the recovery procedure, `recovery_leader` will then be ignored. In the recovery procedure, the start-up will fail on: ``` throw std::runtime_error( "Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. " "If you are trying to run the Raft-based recovery procedure, you must set recovery_leader."); ```	2025-07-23 15:36:56 +02:00
Patryk Jędrzejczak	ec69028907	db/config, utils: allow using UUID as a config option We change the `recovery_leader` option to UUID in the following commit.	2025-07-23 15:36:45 +02:00
Avi Kivity	e89f6c5586	config, main: make cpu scheduling mandatory CPU scheduling has been with us since `641aaba12c` (2017), and no one ever disables it. Likely nothing really works without it. Make it mandatory and mark the option unused. Closes scylladb/scylladb#24894	2025-07-22 12:39:01 +02:00
Pawel Pery	7bf53fc908	vector_store_client: implement initial vector_store_client service This patch is a part of vector_store_client sharded service implementation for a communication with vector-store service. It adds a `services/vector_store_client.{cc\|hh}` sharded service and a configuration parameter `vector_store_uri` with a `http://vector-store.dns.name:port` format. If there will be an error during parsing that parameter there will be an exception during construction. For the future unit testing purposes the patch adds `vector_store_client_tester` as a way to inject mockup functionality. This service will be used by the select statements for the Vector search indexes (see VS-46). For this reason I've added vector_store_client service in the query processor. Reference: VS-47 VS-45	2025-07-08 16:29:55 +02:00
Michał Chojnowski	7d26d3c7cb	db/config: add an option that disables dict-aware sstable compressors in DDL statements For reasons, we want to be able to disallow dictionary-aware compressors in chosen deployments. This patch adds a knob for that. When the knob is disabled, dictionary-aware compressors will be rejected in the validation stage of CREATE and ALTER statements. Closes scylladb/scylladb#24355	2025-06-09 13:30:40 +03:00
Avi Kivity	04fb2c026d	config: decrease default large allocation warning threshold to 128k Back in 2017 (`5a2439e702`), we introduced a check for large allocations as they can stall the memory allocator. The warning threshold was set at 1 MB. Since then many fixes for large allocations went in and it is now time to reduce the threshold further. We reduce it here to 128 kB, the natural allocation size for the system. A quick run showed no warnings. Closes scylladb/scylladb#23975	2025-05-05 12:13:48 +03:00
Kefu Chai	7254c0c515	db/config.cc: correct a typo in option's description s/incomming/incoming/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23826	2025-04-22 16:55:04 +03:00
Nadav Har'El	fbcf77d134	raft: make group0 Raft operation timeout configurable A recent commit `370707b111` (re)introduced a timeout for every group0 Raft operation. This timeout was set to 60 seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody". However, one of the things we do as a group0 operation is schema changes, and we already noticed a few years ago, see commit `0b2cf21932`, that in some extremely overloaded test machines where tests run hundreds of times (!) slower than usual, a single big schema operation - such as Alternator's DeleteTable deleting a table and multiple of its CDC or view tables - sometimes takes more than 60 seconds. The above fix changed the client's timeout to wait for 300 seconds instead of 60 seconds, but now we also need to increase our Raft timeout, or the server can time out. We've seen this happening recently making some tests flaky in CI (issue #23543). So let's make this timeout configurable, as a new configuration option group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e, 60 seconds), the same as the existing default. The test framework overrides this default with a a higher 300 second timeout, matching the client-side timeout. Before this patch, this timeout was already configurable in a strange way, using injections. But this was a misstep: We already have more than a dozen timeouts configurable through the normal configration, and this one should have been configured in the same way. There is nothing "holy" about the default of 60 seconds we chose, and who knows maybe in the future we might need to tweek it in the field, just like we made the other timeouts tweakable. Injections cannot be used in release mode, but configuration options can. Fixes #23543 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23717	2025-04-15 10:57:39 +03:00
Avi Kivity	ed3e4f33fd	Merge 'generic_server: throttle and shed incoming connections according to semaphore limit' from Marcin Maliszkiewicz Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency. It should help to reduce cpu usage by limiting cpu concurrency for new connections. As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed. New connections_shed and connections_blocked metrics are added for tracking. Testing: - manually via simple program creating high number of connection and constantly re-connecting - added benchmark Following are benchmark results: Before: ``` > build/release/test/perf/perf_generic_server --smp=1 170101.41 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4695 insns/op, 3178 cycles/op, 0 errors) [...] throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54 instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96 cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15 ``` After: ``` > build/release/test/perf/perf_generic_server --smp=1 167373.19 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4821 insns/op, 3371 cycles/op, 0 errors) [...] throughput: mean= 171199.55 standard-deviation=2484.58 median= 171667.06 median-absolute-deviation=2087.63 maximum=173689.11 minimum=167904.76 instructions_per_op: mean= 4801.90 standard-deviation=16.54 median= 4796.78 median-absolute-deviation=9.32 maximum=4830.71 minimum=4789.81 cpu_cycles_per_op: mean= 3245.26 standard-deviation=32.28 median= 3230.44 median-absolute-deviation=16.52 maximum=3297.39 minimum=3215.62 ``` The patch adds around 67 insns/op so it's effect on performance should be negligible. Fixes: https://github.com/scylladb/scylladb/issues/22844 Closes scylladb/scylladb#22828 * github.com:scylladb/scylladb: transport: move on_connection_close into connection destructor test: perf: make aggregated_perf_results formatting more human readable transport: add blocked and shed connection metrics generic_server: throttle and shed incoming connections according to semaphore limit generic_server: add data source and sink wrappers bookkeeping network IO generic_server: coroutinize part of server::do_accepts test: add benchmark for generic_server test: perf: add option to count multiple ops per time_parallel iteration generic_server: add semaphore for limiting new connections concurrency generic_server: add config to the constructor generic_server: add on_connection_ready handler	2025-04-09 21:41:38 +03:00
Marcin Maliszkiewicz	ed82bede39	generic_server: add semaphore for limiting new connections concurrency It will be used in following commits.	2025-04-09 10:30:58 +02:00
Botond Dénes	fcdae20fd1	Merge 'Add tablet enforcing option' from Benny Halevy This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing `enable_tablets` option. It can be set to the following values: disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option enabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option `tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether tablets are disabled or enabled by default for new keyspaces, respectively. In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}` keyspace option, when the keyspace is created. `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}` Refs scylladb/scylla-enterprise#4355 * Requires backport to 2025.1 Closes scylladb/scylladb#22273 * github.com:scylladb/scylladb: boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option db/config: add tablets_mode_for_new_keyspaces option	2025-04-03 16:32:19 +03:00
Radosław Cybulski	c36614e16d	alternator: add size check to BatchItemWrite Add a size check for BatchItemWrite command - if the item count is bigger than configuration value `alternator_maximum_batch_write_size`, an error will be raised and no modification will happen. This is done to synchronize with DynamoDB, where maximum size of BatchItemWrite is 25. To avoid complaints from clients, who use our feature of BatchWriteItem being limitless we set default value to 100. Fixes #5057 Closes scylladb/scylladb#23232	2025-04-02 14:48:00 +03:00
Kefu Chai	6da758d74c	config: mark uuid_sstable_identifiers_enabled unused the option of `uuid_sstable_identifier_enabled` was introduced in `f014ccf3` . the first version which has this change was 5.4, and 6.1 has been branched. during the discussion of backup and restore, we realized that we've been taking efforts to address problems which could have been addressed with the sstable with UUID-based identifier. see also #10459 which is the issue which proposed to implement UUID-v1 based sstable identifier. now that two major releases passed, we should have the luxury to mark this option "unused". this option which was previously introduced to keep the backward compatibility, and to allow user to opt-out of the feature for some reasons. so in this change, mark the option unused, so that if any user still sets this option with command line, they will get a clear error. but we still parse and handle this setting in `scylla.yaml`, so that this option is still respected for existing settings, and for existing tests, which are not yet prepared for the uuid-based sstable identifiers. Refs #10459 Fixes #20337 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#20341	2025-04-01 20:21:47 +03:00
Pavel Emelyanov	2ee9cec1d3	Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed. This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f Refs https://github.com/scylladb/scylladb/issues/22428 Closes scylladb/scylladb#22952 * github.com:scylladb/scylladb: Remove db::config::object_storage_config Move `object_storage.yaml` endpoints to `scylla.yaml`	2025-04-01 16:01:44 +03:00
Michał Chojnowski	4f0d453acf	dict_autotrainer: introduce sstable_dict_autotrainer Add a fiber responsible for periodic re-training of compression dictionaries (for tables which opted into dict-aware compression). As of this patch, it works like this: every `$tick_period` (15 minutes), if we are the current Raft leader, we check for dict-aware tables which have no dict, or a dict older than `$retrain_period`. For those tables, if they have enough data (>1GiB) for a training, we train a new dict and check if it's significantly better than the current one (provides ratio smaller than 95% of current ratio), and if so, we update the dict.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	30a9d471fa	sstables: plug an `sstable_compressor_factory` into `sstables_manager` Create a `sstable_compressor_factory_impl` in `scylla_main`, and pipe it through constructors into `sstables_manager`. In next commits, the factory available through the `sstables_manager` will be used to create compressors for SSTable readers and writers.	2025-04-01 00:07:28 +02:00
Robert Bindar	e3a3508960	Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 13:39:39 +03:00
Benny Halevy	62aeba759b	tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}`. Refs scylladb/scylla-enterprise#4355 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 15:32:16 +02:00
Benny Halevy	c62865df90	db/config: add tablets_mode_for_new_keyspaces option The new option deprecates the existing `enable_tablets` option. It will be extended in the next patch with a 3rd value: "enforced" while will enable tablets by default for new keyspace but without the posibility to opt out using the `tablets = {'enabled': false}` keyspace schema option. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 14:54:45 +02:00
Avi Kivity	7646e1448a	Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek This PR is an introductory step towards enforcing RF-rack-valid keyspaces in Scylla. The scope of changes: * defining RF-rack-valid keyspaces, * introducing a configuration option enforcing RF-rack-valid keyspaces, * restricting the CREATE and ALTER KEYSPACE statements so that they never lead to RF-rack invalid keyspaces, * during the initialization of a node, it verifies that all existing keyspaces are RF-rack-valid. If not, the initialization fails. We provide tests verifying that the changes behave as intended. --- Note that there are a number of things that still need to be implemented. That includes, for instance, restricting topology operations too. --- Implementation strategy (going beyond the scope of this PR): 1. Introduce the new configuration option `rf_rack_valid_keyspaces`. 2. Start enforcing RF-rack-validity in keyspaces if the option is enabled. 3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests. 4. Once the tests have been adjusted, change the default value of the option to enabled. 5. Stop explicitly enabling the option in tests. 6. Get rid of the option. --- Fixes scylladb/scylladb#20356 Fixes scylladb/scylladb#23276 Fixes scylladb/scylladb#23300 --- Backport: this is part of the requirements for releasing 2025.1. Closes scylladb/scylladb#23138 * github.com:scylladb/scylladb: main: Refuse to start node when RF-rack-invalid keyspace exists cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces db/config: Introduce RF-rack-valid keyspaces	2025-03-20 19:10:36 +02:00
Dawid Mędrek	32879ec0d5	db/config: Introduce RF-rack-valid keyspaces We introduce a new term in the glossary: RF-rack-valid keyspace. We also highlight in our user documentation that all keyspaces must remain RF-rack-valid throughout their lifetime, and failing to guarantee that may result in data inconsistencies or other issues. We base that information on our experience with materialized views in keyspaces using tablets, even though they remain an experimental feature. Along with the new term, we introduce a new configuration option called `rf_rack_valid_keyspaces`, which, when enabled, will enforce preserving all keyspaces RF-rack-valid. That functionality will be implemented in upcoming commits. For now, we materialize the restriction in form of a named requirement: a function verifying that the passed keyspace is RF-rack-valid. The option is disabled by default. That will change once we adjust the existing tests to the new semantics. Once that is done, the option will first be enabled by default, and then it will be removed. Fixes scylladb/scylladb#20356	2025-03-19 14:46:35 +01:00
Patryk Jędrzejczak	9970c1fcc3	gossip: allow group 0 ID mismatch in the Raft-based recovery procedure This patch ensures that members of the new group 0 can gossip with members of the old group 0 during rolling restart in the Raft-based recovery procedure. Without this change, restarted nodes (members of the new group 0) wouldn't be marked as UP by other nodes (members of the old group 0), which would decrease availability.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	fd51d7e448	treewide: allow recreating group 0 in the Raft-based recovery procedure This patch adds support for recreating group 0 after losing majority. This is the only part of the new Raft-based recovery procedure that touches Scylla core. The following steps are necessary to recreate group 0: 1. Determine the new group 0 members. These are alive nodes that are normal or rebuilding. 2. Choose the recovery leader - the node which will become the new group 0 leader. This must be one of the nodes with the latest persistent group 0 state. 3. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 4. Set the new scylla.yaml parameter - `recovery_leader` - to Host ID of the recovery leader on each live node. 5. Rolling restart all live nodes, but the recovery leader must be restarted first. In the implementation, restarts in step 5 are very similar to normal restarts with the Raft-based topology enabled. The only differences are: 1. Steps 3-4 make the restarting node discover the new group 0 in `join_cluster`. 2. The group 0 server is started in `join_group0`, not `setup_group0_if_exists`. 3. The restarting node joins the new group 0 in `join_topology` using `legacy_handshaker`. There is no reason to contact the topology coordinator since the node has already joined the topology. Unfortunately, this patch creates another execution path for the starting logic. `join_cluster` becomes even messier. However, there is nothing we can do about it. Joining group 0 without joining topology is something completely new. Having a few small changes without touching other execution paths is the best we can do. We will start removing the old stuff soon, after making the Raft-based topology mandatory, and the situation will improve.	2025-03-14 13:52:57 +01:00
Tomasz Grabiec	d01cc16d1e	config, disk_space_monitor: Allow overriding capacity via config Intended for testing, or hot-fixing out-of-space issues in production. Tablet load balancer uses this information for determining per-shard load so reducing capacity will cause tablets to be migrated away from the node.	2025-03-06 13:35:37 +01:00
Botond Dénes	71d8b7aa9f	querier: demote tombstone warning for range-scans to debug level Range scans are expected to go though lots of tombstones, no need to spam the logs about this. The tombstone warning log is demoted to debug level, if somebody wants to see it they can bump the logger to debug level. Fixes: https://github.com/scylladb/scylladb/issues/23093 Closes scylladb/scylladb#23094	2025-03-04 10:38:06 +03:00
Calle Wilund	2f10205714	config: Enable optional TLS1.3 session ticket usage in cert setup Refs #22916 Adds an "enable_session_tickets" option to TLS setup for our server endpoints (not documented for internode RPC, as we don't handle it on the client side there), allowing enabling of TLS3 client session ticket, i.e. quicker reconnect. Session tickets are valid within a time frame or until a node restarts, whichever comes first. v2: Use "TLS1.3" in help message Closes scylladb/scylladb#22928	2025-03-04 09:30:53 +02:00
Botond Dénes	5d63ef4d15	Merge 'scylla sstable: Add standard extensions and propagate to schema load ' from Calle Wilund Fixes #22314 Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them. Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points. Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line. Closes scylladb/scylladb#22327 * github.com:scylladb/scylladb: tools: Add standard extensions and propagate to schema load cql_test_env: Use add all extensions instead of inidividually main: Move extensions adding to function tomstone_gc: Make validate work for tools	2025-02-26 13:52:47 +02:00
Kefu Chai	9fdbe0e74b	tree: Remove unused boost headers This commit eliminates unused boost header includes from the tree. Removing these unnecessary includes reduces dependencies on the external Boost.Adapters library, leading to faster compile times and a slightly cleaner codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22997	2025-02-25 10:32:32 +03:00
Tomasz Grabiec	3d01ce3707	config: Make tablets_initial_scale_factor live-updateable	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	f1bda8d4c1	tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal The limit is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. The limit is configurable. It's a global per-cluster config which controls how many tablet replicas per shard in total we consider to be still ok. It controls tablet allocator behavior, when choosing initial tablet count. Even though it's a per-node config, we don't support different limits per node. All nodes must have the same value of that config. It's similar in that regard to other scheduler config items like tablets_initial_scale_factor and target_tablet_size_in_bytes.	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	f043c83ba5	tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard Currently the scale is applied post rounding up of tablet count so that tablet count per shard is at least 1. In order to be able to use the scale to increase tablet count per shard, we need to apply it prior to division by RF, otherwise we will overshoot per-shard tablet replica count. Example: 4 nodes, -c1, rf=3, initial_tablets_scale=10 Before: initial_tablet_count=20, tablet-per-shard=15 After: initial_tablet_count=14, tablets-per-shard=10.5	2025-02-19 14:38:50 +01:00
Tomasz Grabiec	2463e524ed	tablets: Set default initial tablet count scale to 10 This will result in new tables having at least 10 tablet replicas per shard by default. We want this to reduce tablet load imbalance due to differences in tablet count per shard, where some shards have 1 tablet and some shards have 2 tablets. With higher tablet count per shard, this difference-by-one is less relevant. Fixes #21967 In some tests, we explicity set the initial scale to 1 as some of the existing tests assume 1 compaction group per shard. test.py uses a lower default. Having many tablets per shard slows down certain topology operations like decommission/replace/removenode, where the running time is proportional to tablet count, not data size, because constant cost (latency) of migration dominates. This latency is due to group0 operations and barriers. This is especially pronounced in debug mode. Scheduler allows at most 2 migrations per shard, so this latency becomes a determining factor for decommission speed. To avoid this problem in tests, we use lower default for tablet count per shard, 2 in debug/dev mode and 4 in release mode. Alternatively, we could compensate by allowing more concurrency when migrating small tablets, but there's no infrastructure for that yet. I observed that with 10 tablets per shard, debug-mode topology_custom.mv/test_mv_topology_change starts to time-out during removenode (30 s).	2025-02-19 14:38:50 +01:00
Botond Dénes	f808f84a45	db/config: improve description of repair_multishard_reader_enable_read_ahead The current description has a typo and in general not informative enough on when this option should be used. Closes scylladb/scylladb#21758	2025-02-11 22:16:09 +02:00
Botond Dénes	3d12451d1f	db/config: reader_concurrency_semaphore_cpu_concurrency: bump default to 2 This config item controls how many CPU-bound reads are allowed to run in parallel. The effective concurrency of a single CPU core is 1, so allowing more than one CPU-bound reads to run concurrently will just result in time-sharing and both reads having higher latency. However, restricting concurrency to 1 means that a CPU bound read that takes a lot of time to complete can block other quick reads while it is running. Increase this default setting to 2 as a compromise between not over-using time-sharing, while not allowing such slow reads to block the queue behind them. Fixes: #22450 Closes scylladb/scylladb#22679	2025-02-05 21:52:20 +02:00
Pavel Emelyanov	e47c7d5255	Merge 'config: Improve internode_compression option validation and documentation' from Kefu Chai This PR enhances the internode_compression configuration option in two ways: 1. Add validation for option values Previously, we silently defaulted to 'none' when given invalid values. Now we explicitly validate against the three supported values (all, dc, none) and reject invalid inputs. This provides better error messages when users misconfigure the option. 2. Fix documentation rendering The help text for this option previously used C++ escape sequences which rendered incorrectly in Sphinx-generated HTML. We now use bullet points with '' prefix to list the available values, matching our documentation style for other config options. This ensures consistent rendering in both CLI and HTML outputs. Note: The current documentation format puts type/default/liveness information in the same bullet list as option values. This affects other config options as well and will need to be addressed in a separate change. --- this improves the handling of invalid option values, and improves the doc rendering, neither of which is critical. hence no need to backport. Closes scylladb/scylladb#22548 github.com:scylladb/scylladb: config: validate internode_compression option values config: start available options with '*'	2025-02-04 10:17:23 +03:00
Kefu Chai	8d0cabb392	config: validate internode_compression option values Previously, the internode_compression option silently defaulted to 'none' for any unrecognized value instead of validating input. It only compared against 'all' and 'dc', making it error-prone. Add explicit validation for the three supported values: - all - dc - none This ensures invalid values are rejected both in command line and YAML configuration, providing better error messages to users. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-29 14:52:35 +08:00
Kefu Chai	a81416862a	config: start available options with '' use '' prefix for config option values instead of escape sequences The custom Sphinx extension that generates documentation from config.cc help messages has issues with C++ escape sequences. For example, "\tall: All traffic" renders incorrectly as "tall: All traffic" in HTML output. Instead of using escape sequences, switch to bullet-point style with '*' prefix which works better in both CLI and HTML rendering. This matches our existing documentation style for available option values in other configs. Note: This change puts type/default/liveness info in the same bullet list as option values. This limitation affects other similar config options and will need to be addressed comprehensively in a future change. Refs scylladb/scylladb#22423 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-29 14:52:35 +08:00
Andrzej Jackowski	5651cc49ed	audit: make categories, tables, and keyspaces liveupdatable This change: - Set liveness::LiveUpdate for audit_categories, audit_tables, and audit_keyspaces - Keep const reference to db::config in audit, so current config values can be obtained by audit implementation - Implement function audit::update_config to parse given string, update audit datastructures when needed, and log the changes. - Add observers to call audit::update_config when categories, tables, or keyspaces configuration changes Fixes scylladb/scylla-enterprise#1789	2025-01-27 11:37:13 +01:00
Kefu Chai	769162de91	tree: correct misspellings these misspellings were identified by codespell. let's fix them. one of them is a part of a user visble string. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22443	2025-01-26 15:54:06 +02:00
Kefu Chai	d1c222d9bd	config: specialize config_from_string() for sstring Specialize config_from_string() for sstring to resolve lexical_cast stream state parsing limitation. This enables correct handling of empty string configurations, such as setting an empty value in CQL: ```cql UPDATE system.config SET value='' WHERE name='allowed_repair_based_node_ops'; ``` Previous implementation using boost::lexical_cast would fail due to EOF stream state, incorrectly rejecting valid empty string conversions. Fixes scylladb/scylladb#22491 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22492	2025-01-26 15:53:12 +02:00
Asias He	4018dc7f0d	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22467	2025-01-26 12:51:59 +02:00
Avi Kivity	59d3a66d18	Revert "Introduce file stream for tablet" This reverts commit `8208688178`. It was contributed from enterprise, but is too different from the original for me to merge back.	2025-01-22 09:42:20 +02:00
Nadav Har'El	3e16b80014	Merge 'Reject create table with compact storage' from Benny Halevy As discussed in https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813, compact storage tables are deprecated. Yet, there's is nothing in the code that prevents users from creating such tables. This patch adds a live-updateable config option: `enable_create_table_with_compact_storage`, set to `false` by default, that require users to opt-in in order to create new tables WITH COMPACT STORAGE. Refs scylladb/scylladb#12263, scylladb/scylladb#16375 * Since this guardrail is an enhancement, no backport is needed Closes scylladb/scylladb#16403 * github.com:scylladb/scylladb: docs: ddl: document the deprecation of compact tables test: enable_create_table_with_compact_storage for tests that need it config: add enable_create_table_with_compact_storage	2025-01-20 22:02:02 +02:00
Asias He	8208688178	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22034	2025-01-20 16:43:21 +02:00

1 2 3 4 5 ...

426 Commits