scylladb

Author	SHA1	Message	Date
Patryk Jędrzejczak	73db5c94de	Merge 'db: api: service: introduce system.client_routes table and related API endpoints' from Andrzej Jackowski `system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`. This patch series contains: - Introduction of `CLIENT_ROUTES` feature flag. - Implementation of raft-based `system.client_routes` table - Implementation of `v2/client-routes` POST/DELETE/GET endpoints - Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed - New tests that verifies the aforementioned features Ref: scylladb/scylla-enterprise#5699 For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build. Closes scylladb/scylladb#27323 * https://github.com/scylladb/scylladb: docs: describe CLIENT_ROUTES_CHANGE extension test: add test for CLIENT_ROUTES event service: transport: add CLIENT_ROUTES_CHANGE event test: add cluster tests for client routes test: add API tests for client_routes endpoints test: add `timeout` parameter to `delete` in RESTClient test: allow json_body in send api: implement client_routes endpoints api: add client_routes.json service: main: add client_routes_service db: add system.client_routes table gms: add CLIENT_ROUTES feature	2025-12-16 10:38:27 +01:00
Andrzej Jackowski	c2b1b10ca0	service: transport: add CLIENT_ROUTES_CHANGE event Introduce the CLIENT_ROUTES_CHANGE event to let drivers refresh connections when `system.client_routes` is modified. Some deployments (e.g., Private Link) require specific address/port mappings that can change without topology changes and drivers need to adapt promptly to avoid connectivity issues. This new EVENT type carries a change indicator plus the affected `connection_ids` and `host_ids`. The only change value is `UPDATE_NODES`, meaning one or more client routes were inserted, updated, or deleted. Drivers subscribe using the existing events mechanism, so no additional `cql_protocol_extension` key is required. Ref: scylladb/scylla-enterprise#5699	2025-12-15 18:19:37 +01:00
Andrzej Jackowski	e153cc434f	api: implement client_routes endpoints Ref: scylladb/scylla-enterprise#5699	2025-12-15 17:36:47 +01:00
Andrzej Jackowski	6fcc1ecf94	service: main: add client_routes_service Introduce `client_routes_service` for managing `system.client_routes` table. Ref: scylladb/scylla-enterprise#5699	2025-12-15 13:13:40 +01:00
Pavel Emelyanov	3f7ee3ce5d	Merge 'batchlog: make replay (flush) faster' from Botond Dénes The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied. When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table. Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range. When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC. The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions. This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); The new schema organization has the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed. Fixes: https://github.com/scylladb/scylladb/issues/23358 This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`. Closes scylladb/scylladb#26671 * github.com:scylladb/scylladb: db/config: change batchlog_replay_cleanup_after_replays default to 1 test/boost/batchlog_manager_test: add test for batchlog cleanup replica/mutation_dump: always set position weight for clustering positions service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ test/lib: introduce error_injection.hh utils/error_injection: add debug log to disable() and disable_all() test/lib/cql_test_env: forward config to batchlog test/lib/cql_test_env: add batch type to execute_batch() test/lib/cql_assertions: add with_size(predicate) overload test/lib/cql_assertions: add source location to fail messages test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row() test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload db/batchlog_manager: config: s/write_timeout/reply_timeot/ db,service: switch to system.batchlog_v2 db/system_keyspace: introduce system.batchlog_v2 service,db: extract generation of batchlog delete mutation service,db: extract get_batchlog_mutation_for() from storage-proxy db/batchlog_manager: only consider propagation delay with tombstone-gc=repair db/batchlog_manager: don't drop entire batch if one mutations' table was dropped data_dictionary: table: add get_truncation_time() db/batchlog_manager: batch(): replace map_reduce() with simple loop db/batchlog_manager: finish coroutinizing replay_all_failed_batches db/batchlog_manager: improve replayAllFailedBatches logs	2025-12-15 15:05:19 +03:00
Andrzej Jackowski	5afcec4a3d	Revert "auth: move passwords::check call to alien thread" The alien thread was a solution for reactor stalls caused by indivisible password‑hashing tasks (scylladb/scylladb#24524). However, because there is only one alien thread, overall hashing throughput was reduced (see, e.g., scylladb/scylla-enterprise#5711). To address this, the alien‑thread solution is reverted, and a hashing implementation with yielding will be introduced later in this patch series. This reverts commit `9574513ec1`.	2025-12-10 15:36:09 +01:00
Avi Kivity	d811eeb4ca	Merge 'Make direct failure detector verb handler more efficient' from Gleb Natapov We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group. Fixes https://github.com/scylladb/scylladb/issues/27142 Backport to all version where `60f1053087` was backported to since it should improve performance in large clusters. Closes scylladb/scylladb#27387 * github.com:scylladb/scylladb: direct_failure_detector: run direct failure detector in the gossiper scheduling group raft: drop invoke_on from the pinger verb handler direct_failure_detector: pass timeout to direct_fd_ping verb	2025-12-07 11:40:26 +02:00
Tomasz Grabiec	d4014b7970	Drop legacy schema support We switched to using v3 schema tables (in system_schema keyspace) in 2017, in `9eb91bc30b`. So no system should have the old schema any more. No need to run legacy_schema_migrator on boot. Closes scylladb/scylladb#27420	2025-12-07 00:09:13 +02:00
Tomasz Grabiec	e54abde3e8	Merge 'main: delay setup of storage_service REST API' from Andrzej Jackowski The storage_service REST API uses `group0` internally. Before this patch, it was possible to send an HTTP request before `group0` was initialized, which resulted in a segmentation fault. Therefore, this patch delays the setup of the storage_service REST API. Additionally, `test_rest_api_on_startup` is added to reproduce the problem. Fixes: https://github.com/scylladb/scylladb/issues/27130 No backport. It's a crash fix but possible only if a request is sent in a very specific phase of a node start. Closes scylladb/scylladb#27410 * github.com:scylladb/scylladb: test: add test_rest_api_on_startup main: delay setup of storage_service REST API	2025-12-04 14:56:49 +01:00
Gleb Natapov	86dde50c0d	direct_failure_detector: run direct failure detector in the gossiper scheduling group When direct failure detector was introduces the idea was that it will run on the same connection raft group0 verbs are running, but in `60f1053087` raft verbs were moved to run on the gossiper connection while DIRECT_FD_PING was left where it was. This patch move it to gossiper connection as well and fix the pinger code to run in gossiper scheduling group.	2025-12-04 11:35:43 +02:00
Avi Kivity	b82f92b439	main: replace p11-kit hack for trust paths override with gnutls hack p11-kit has hardcoded paths for the trust paths. Of course, each Linux distribution hardcodes those paths differently. As a result, our relocatable gnutls, which uses p11-kit-trust.so to process the trust paths, needs some overrides to select the right paths. Currently, we use p11_kit_override_system_files(), a p11-kit API intended for testing, but which worked well enough for our purpose, to override the trust module configuration. Unfortunately, starting (presumably [1]) in gnutls 3.8.11, gnutls changed how it works with p11-kit and our override is now ignored. This was likely unintentional, but there appears to be a better way: instead of letting gnutls auto-load the trust module from a hacked configuration, we load the modules outselves using gnutls_pkcs11_init(GNUTLS_PKCS11_FLAG_MANUAL) and gnutls_pkcs11_add_provider(). These appear to be intended for the purpose. We communicate the paths to the scylla executable using an environment variable. This isn't optimal, but is much easier than adding a command line variable since there are multiple levels of command line parsing due to the subtool mechanism. With this, we unlock the possibility to upgrade gnutls to newer versions. [1] `aa5f15a872` Closes scylladb/scylladb#27348	2025-12-04 11:33:51 +02:00
Botond Dénes	b9199e8b24	Merge 'auth: use auth cache on login path' from Marcin Maliszkiewicz Scylla currently has bad resiliency to connection storms. Nodes are easy to overload or impact their latency by unbound concurrency in making new connections on the client side. This can easily happen in bigger deployments where there are thousands of client instances, e.g. pods. To improve resiliency we are introducing unified auth specialized cache to the system. This patch series is stage 1, where cache is used only on login path. Dependency diagram: ``` \|Authentication Layer\| \| v +--------------------------------+ \| Auth Cache \| +--------------------------------+ ^ \| \| \| \| v \|Raft Write Logic \| \| CQL Read Layer\| ``` Cache invalidation is based on raft and the cache contains full content of related tables. Ldap role manager may benefit partially as can_logic function is common and will be cached, but it still needs to query roles from external source. Performance results: For single shard connection/disconnection scenario insns/conn decreased by 5%, allocs/conn decreased by 23%, tasks/conn decreased by 20%. Results for 20 shards are very similar. Raw data before: ``` ≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0, auth, connection_per_request} Pre-populated 10000 partitions 1128.55 tps (599.2 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2586610 insns/op, 1350912 cycles/op, 0 errors) 1157.41 tps (601.3 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2589046 insns/op, 1356691 cycles/op, 0 errors) 1167.42 tps (603.3 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2603234 insns/op, 1360607 cycles/op, 0 errors) 1159.63 tps (605.9 allocs/op, 0.0 logallocs/op, 145.3 tasks/op, 2609977 insns/op, 1363935 cycles/op, 0 errors) 1165.12 tps (608.8 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2625804 insns/op, 1365736 cycles/op, 0 errors) throughput: mean= 1155.63 standard-deviation=15.66 median= 1159.63 median-absolute-deviation=9.49 maximum=1167.42 minimum=1128.55 instructions_per_op: mean= 2602934.31 standard-deviation=16063.01 median= 2603234.19 median-absolute-deviation=13887.96 maximum=2625804.05 minimum=2586609.82 cpu_cycles_per_op: mean= 1359576.30 standard-deviation=5945.69 median= 1360607.05 median-absolute-deviation=4358.94 maximum=1365736.42 minimum=1350912.10 ``` Raw data after: ``` ≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true --duration 10 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=10, ops_per_shard=0, auth, connection_per_request} Pre-populated 10000 partitions 1132.09 tps (457.5 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2432485 insns/op, 1270655 cycles/op, 0 errors) 1157.70 tps (458.4 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2447779 insns/op, 1283768 cycles/op, 0 errors) 1162.86 tps (459.0 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2463225 insns/op, 1291782 cycles/op, 0 errors) 1153.15 tps (460.2 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2469230 insns/op, 1296381 cycles/op, 0 errors) 1142.09 tps (460.6 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2478900 insns/op, 1299342 cycles/op, 0 errors) 1124.89 tps (462.5 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2470962 insns/op, 1305026 cycles/op, 0 errors) 1156.75 tps (464.4 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2493823 insns/op, 1305136 cycles/op, 0 errors) 1152.16 tps (466.3 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2497246 insns/op, 1309816 cycles/op, 0 errors) 1154.77 tps (469.8 allocs/op, 0.0 logallocs/op, 115.5 tasks/op, 2571954 insns/op, 1345341 cycles/op, 0 errors) 1152.22 tps (472.4 allocs/op, 0.0 logallocs/op, 115.3 tasks/op, 2551954 insns/op, 1334202 cycles/op, 0 errors) throughput: mean= 1148.87 standard-deviation=12.08 median= 1153.15 median-absolute-deviation=7.88 maximum=1162.86 minimum=1124.89 instructions_per_op: mean= 2487755.88 standard-deviation=43838.23 median= 2478900.02 median-absolute-deviation=24531.06 maximum=2571954.26 minimum=2432485.38 cpu_cycles_per_op: mean= 1304144.76 standard-deviation=22129.55 median= 1305025.71 median-absolute-deviation=12363.25 maximum=1345341.16 minimum=1270655.17 ``` Fixes https://github.com/scylladb/scylladb/issues/18891 Backport: no, it's a new feature Closes scylladb/scylladb#26841 * github.com:scylladb/scylladb: auth: use auth cache on login path auth: corutinize standard_role_manager::can_login main: auth: add auth cache dependency to auth service raft: update auth cache when data changes auth: storage_service: reload auth cache on v1 to v2 auth migration raft: reload auth cache on snapshot application service: add auth cache getter to storage service main: start auth cache service auth: add unified cache implementation auth: move table names to common.hh	2025-12-03 16:45:01 +02:00
Andrzej Jackowski	3b70154f0a	main: delay setup of storage_service REST API The storage_service REST API uses `group0` internally. Before this patch, it was possible to send an HTTP request before `group0` was initialized, which resulted in a segmentation fault. Therefore, this patch delays the setup of the storage_service REST API. Fixes: scylladb/scylladb#27130	2025-12-03 15:35:54 +01:00
Botond Dénes	e309b5dbe1	db/batchlog_manager: config: s/write_timeout/reply_timeot/ Although the value of this item is indeed derived from the write timeout config, the name doesn't reflect what it is used for. Change it to reflect it better.	2025-12-02 14:21:26 +02:00
Marcin Maliszkiewicz	b29c42adce	main: auth: add auth cache dependency to auth service In the following commit we'll switch some authorizer and role manager code to use the cache so we're preparing the dependency.	2025-11-26 12:01:31 +01:00
Marcin Maliszkiewicz	2cf1ca43b5	service: add auth cache getter to storage service Prepare for use in a subsequent commit in group0_state_machine, where the auth cache will be integrated. This follows the same pattern as updates to the service-level cache, view-building state, and CDC streams.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	642f468c59	main: start auth cache service The service is not yet used anywhere, we first build scaffolding.	2025-11-26 12:00:50 +01:00
Nadav Har'El	9cde93e3da	Merge 'db/view/view_building_coordinator: get rid of task's state in group0' from Michał Jadwiszczak Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability. With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead. In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments. Fixes https://github.com/scylladb/scylladb/issues/26311 This patch needs to be backported to 2025.4. Closes scylladb/scylladb#26897 * github.com:scylladb/scylladb: docs/dev/view-building-coordinator: update the docs after recent changes db/view/view_building: send coordinator's term in the RPC db/view/view_building_state: replace task's state with `aborted` flag db/view/view_building_coordinator: batch finished tasks reporting db/view/view_building_worker: change internal implementation db/view/view_building_coordinator: change `work_on_tasks` RPC return type	2025-11-26 11:35:44 +02:00
Botond Dénes	384bffb8da	Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho This PR adds support for limiting the maximum shares allocated to a compaction scheduling class by the compaction controller. It introduces a new configuration parameter, compaction_max_shares, which, when set to a non zero value, will cap the shares allocated to compaction jobs. This PR also exposes the shares computed by the compaction controller via metrics, for observability purposes. Fixes https://github.com/scylladb/scylladb/issues/9431 Enhancement. No need to backport. NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696 Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50) Closes scylladb/scylladb#27024 * github.com:scylladb/scylladb: db/config: introduce new config parameter `compaction_max_shares` compaction_manager:config: introduce max_shares compaction_controller: add configurable maximum shares compaction_controller: introduce `set_max_shares()`	2025-11-26 06:51:30 +02:00
Michał Jadwiszczak	fb8cbf1615	db/view/view_building: send coordinator's term in the RPC To avoid case when an old coordinator (which hasn't been stopped yet) dictates what should be done, add raft term to the `work_on_view_building_tasks` RPC. The worker needs to check if the term matches the current term from raft server, and deny the request when the term is bad.	2025-11-25 12:14:05 +01:00
Lakshmi Narayanan Sreethar	9cb766f929	db/config: introduce new config parameter `compaction_max_shares` Add support for the new configuration parameter `compaction_max_shares`, and update the compaction manager to pass it down to the compaction controller when it changes. The shares allocated to compaction jobs will be limited by this new parameter. Fixes #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 12:52:29 -03:00
Lakshmi Narayanan Sreethar	468b800e89	compaction_manager:config: introduce max_shares Introduce an updateable value `max_shares` to compaction manager's config. Also add a method `update_max_shares()` that applies the latest `max_shares` value to the compaction controller’s `max_shares`. This new variable will be connected to a config parameter in the next patch. Refs #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 11:43:38 -03:00
Calle Wilund	3c4546d839	messaging_service: Add internode_compression=rack as option Fixes #27085 Adds a "rack" option to enum/config and handles in connection setup in messaging_service. Closes scylladb/scylladb#27099	2025-11-21 11:50:55 +02:00
Botond Dénes	d54d409a52	Merge 'audit: write out to both table and syslog' from Dario Mirovic This patch adds support for multiple audit log outputs. If only one audit log output is enabled, the behavior does not change. If multiple audit log outputs are enabled, then the `audit_composite_storage_helper` class is used. It has a collection of `storage_helper` objects. Performance testing shows that read query throughput and auth request throughput are consistent even at high reactor utilization. It can also be observed that read query latency increases a bit. Read query ops = 60k/s AUTH ops = 200/s \| Audit Mode \| QUERY latency (p99) \| Δ% vs none \| \|------------\|---------------------\|------------\| \| none \| 777 \| 0 \| \|table\| 801 \| +3.09% \| \|syslog \| 803 \| +3.35% \| \|table,syslog \| 818 \| +5.28% \| Read query ops = 50k/s AUTH ops = 200/s \| Audit Mode \| QUERY latency (p99) \| Δ% vs none \| \|------------\|---------------------\|------------\| \| none \| 643 \| 0 \| \|table\| 647 \| +0.62% \| \|syslog \| 648 \| +0.78% \| \|table,syslog \| 656 \| +2.02% \| Detailed performance results are in the following Confluence document: [Audit performance impact test](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test) Fixes #26022 Backport: The decision is to not backport for now. After making sure it works on the latest release, and if there is a need, we can do it. Closes scylladb/scylladb#26613 * github.com:scylladb/scylladb: test: dtest: audit_test.py: add AuditBackendComposite test: dtest: audit_test.py: group logs in dict per audit mode audit: write out to both table and syslog audit: move storage helper creation from `audit::start` to `audit::audit` audit: fix formatting in `audit::start_audit` audit: unify `create_audit` and `start_audit`	2025-11-17 15:04:15 +02:00
Marcin Maliszkiewicz	958d04c349	service: attach storage_service to migration_manager using pluggabe Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734	2025-11-14 08:50:19 +01:00
Dario Mirovic	549e6307ec	audit: unify `create_audit` and `start_audit` There is no need to have `create_audit` separate from `start_audit`. `create_audit` just stores the passed parameters, while `start_audit` does the actual initialization and startup work. Refs #26022	2025-11-06 03:05:06 +01:00
Nikos Dragazis	2fc812a1b9	db/config: Change default SSTable compressor to LZ4WithDictsCompressor `sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is `LZ4Compressor` (inherited from Cassandra). Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios. If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-30 15:53:49 +02:00
Nikos Dragazis	96e727d7b9	db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl The option is a knob that allows to reject dictionary-aware compressors in the validation stage of CREATE/ALTER statements, and in the validation of `sstable_compression_user_table_options`. It was introduced in `7d26d3c7cb` to allow the admins of Scylla Cloud to selectively enable it in certain clusters. For more details, check: https://github.com/scylladb/scylla-enterprise/issues/5435 As of this series, we want to start offering dictionary compression as the default option in all clusters, i.e., treat it as a generally available feature. This makes the knob redundant. Additionally, making dictionary compression the default choice in `sstable_compression_user_table_options` creates an awkward dependency with the knob (disabling the knob should cause `sstable_compression_user_table_options` to fall back to a non-dict compressor as default). That may not be very clear to the end user. For these reasons, mark the option as "Deprecated", remove all relevant tests, and adjust the business logic as if dictionary compression is always available. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-29 20:13:08 +02:00
Radosław Cybulski	621e88ce52	Fix spelling errors Closes scylladb/scylladb#26652	2025-10-22 16:46:31 +02:00
Botond Dénes	c543059f86	Merge 'Synchronize tablet split and load-and-stream' from Raphael Raph Carvalho Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements # 1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes https://github.com/scylladb/scylladb/issues/26455. Closes scylladb/scylladb#26456 * github.com:scylladb/scylladb: test: Add reproducer for l-a-s and split synchronization issue sstables_loader: Synchronize tablet split and load-and-stream	2025-10-21 09:43:38 +03:00
Raphael S. Carvalho	3abc66da5a	sstables_loader: Synchronize tablet split and load-and-stream Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements #1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes #26455. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-10-20 19:17:22 -03:00
Pavel Emelyanov	44ed3bbb7c	Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage. Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers. This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend. Similarly with storage_options. Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc). Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends. Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake. Fixes #25359 Fixes #26453 Closes scylladb/scylladb#26186 * github.com:scylladb/scylladb: docs::dev::object_storage: Add some initial info on GS storage docs/dev: Add mention of (nested) docker usage in testing.md sstables::object_storage_client: Forward memory limit semaphore to GS instance utils::gcp::object_storage: Add optional memory limits to up/download sstables::object_storage_client: Add multi-upload support for GS utils::gcp::storage: Add merge objects operation test_backup/test_basic: Make tests multiplex both s3 and gs backends test::cluster::conftest: Add support for multiple object storage backends boost::gcs_storage_test: reindent boost::gcs_storage_test: Convert to use fixture tests::boost: Add GS object storage cases to mirror S3 ones tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env sstables::object_storage_client: Add google storage implementation test_services: Allow testing with GS object storage parameters utils::gcp::gcp_credentials: Add option to create uninitialized credentials utils::gcp::object_storage: Make create_download_source return seekable_data_source utils::gcp::object_storage: Add defensive copies of string_view params utils::gcp::object_storage: Add missing retry backoff increate utils::gcp::object_storage: Add timestamp to object listing utils::gcp::object_storage: Add paging support to list_objects object_storage_client: Add object_name wrapper type utils::gcp::object_storage: Add optional abort_source utils::rest::client: Add abort_source support sstables: Use object_storage_client for remote storage sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) s3::upload_progress: Promote to general util type storage_options: Abstract s3 to "object_storage" and add gs as option sstables::file_io_extension: Change "creator" callback to just data_source utils::io-wrappers: Add ranged data_source utils::io-wrappers: Add file wrapper type for seekable_source utils::seekable_source: Add a seekable IO source type object_storage_endpoint_param: Add gs storage as option config: break out object_storage_endpoint_param preparing for multi storage	2025-10-20 13:14:53 +03:00
Marcin Maliszkiewicz	389afcdeb6	service: fix dependencies during migration_manager startup We need to avoid reloading schema early as it goes via schema_applier which internally depends on storage_service and on distribued_loader initializing all keyspaces. Simply moving migration manager startup later in the code is not easy as some services depend on it being initialized so we just enable those feature listeners a bit later.	2025-10-14 10:56:26 +02:00
Calle Wilund	5d4558df3b	sstables: Use object_storage_client for remote storage Replaces direct s3 interfaces with the abstraction layer, and open for having multiple implentations/backends	2025-10-13 08:53:25 +00:00
Piotr Dulikowski	0b800aab17	Merge 'db/view/view_building_worker: move `discover_existing_staging_sstables()` to the foreground' from Michał Jadwiszczak db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground This patch moves `discover_existing_staging_sstables()` to be executed from main level, instead of running it on the background fiber. This method need to be run only once during the startup to collect existing staging sstables, so there is no need to do it in the background. This change will increase debugability of any further issues related to it (like https://github.com/scylladb/scylladb/issues/26403). Fixes https://github.com/scylladb/scylladb/issues/26417 The patch should be backported to 2025.4 Closes scylladb/scylladb#26446 * github.com:scylladb/scylladb: db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground db/view/view_building_worker: futurize and rename `start_background_fibers()`	2025-10-09 18:24:50 +02:00
Michał Jadwiszczak	575dce765e	db/view/view_building_worker: futurize and rename `start_background_fibers()` Next commit will move `discover_existing_staging_sstables()` to the foreground, so to prepare for this we need to futurize `start_background_fibers()` method and change its name to better reflect its purpose.	2025-10-08 10:19:41 +02:00
Botond Dénes	8b0bfb817e	Merge 'Switch REST API server to use content-streaming' from Pavel Emelyanov Seastar httpd recommended users to stop using contiguous requet.content string and read body they need from request's input_stream instead. However, "official" deprecation of request content had been only made recently. This PR patches REST API server to turn this feature on and patches few handlers that mess with request bodies to read them from request stream. Using newer seastar API, no need to backport Closes scylladb/scylladb#26418 * github.com:scylladb/scylladb: api: Switch to request content streaming api: Fix indentation after previous patch api: Coroutinize set_relabel_config handler api: Coroutinize set_error_injection handler	2025-10-07 14:13:47 +03:00
Botond Dénes	8beea931be	Merge 'Remove system_keyspace from column_family API' from Pavel Emelyanov This dependency reference is carried into column_family handlers block to make get_built_views handler work. However, the handler in question should live in view_builder block, because it works with v.b. data. This PR moves the handler there, while at it, coroutinizes it, and removes the no longer needed sys.ks. reference from column_family. API dependencies cleanup work, no need to backport Closes scylladb/scylladb#26381 * github.com:scylladb/scylladb: api: Fix indentation after previous patch api: Coroutinize get_built_indexes handler code api: Remove system_keyspace ref from column_family API block api: Move get_built_indexes from column_family to view_builder	2025-10-07 13:07:46 +03:00
Pavel Emelyanov	127afd4da1	api: Switch to request content streaming There are three handler that need to be patched all at once with the server itself being marked with set_content_streaming For two simple handler just get the content string with read_entire_stream_contiguous helper. This is what httpd server did anyway. The "start_restore" handler used the contiguous contents to parse json from using rjson utility. This handler is patched to use read_entire_stream() that returns a vector of temporary buffers. The rjson parser has a helper to pars from that vector, so the change is also optimization. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-06 16:43:26 +03:00
Piotr Dulikowski	e7907b173a	Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek Materialized views are currently in the experimental phase and using them in tablet-based keyspaces requires starting Scylla with an experimental feature, `views-with-tablets`. Any attempts to create a materialized view or secondary index when it's not enabled will fail with an appropriate error. After considerable effort, we're drawing close to bringing views out of the experimental phase, and the experimental feature will no longer be needed. However, materialized views in tablet-based keyspaces will still be restricted, and creating them will only be possible after enabling the configuration option `rf_rack_valid_keyspaces`. That's what we do in this PR. In this patch, we adjust existing tests in the tree to work with the new restriction. That shouldn't have been necessary because we've already seemingly adjusted all of them to work with the configuration option, but some tests hid well. We fix that mistake now. After that, we introduce the new restriction. What's more, when starting Scylla, we verify that there is no materialized view that would violate the contract. If there are some that do, we list them, notify the user, and refuse to start. High-level implementation strategy: 1. Name the restrictions in form of a function. 2. Adjust existing tests. 3. Restrict materialized views by both the experimental feature and the configuration option. Add validation test. 4. Drop the requirement for the experimental feature. Adjust the added test and add a new one. 5. Update the user documentation. Fixes scylladb/scylladb#23030 Backport: 2025.4, as we are aiming to support materialized views for tablets from that version. Closes scylladb/scylladb#25802 * github.com:scylladb/scylladb: view: Stop requiring experimental feature db/view: Verify valid configuration for tablet-based views db/view: Require rf_rack_valid_keyspaces when creating view test/cluster/random_failures: Skip creating secondary indexes test/cluster/mv: Mark test_mv_rf_change as skipped test/cluster: Adjust MV tests to RF-rack-validity test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces db/view: Name requirement for views with tablets	2025-10-06 12:46:46 +02:00
Pavel Emelyanov	f77f9db96c	api: Remove system_keyspace ref from column_family API block This reference was only needed to facilitate get_built_indexes handler to work. Now it's gone and the sys.ks. reference is no longer needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-03 13:50:22 +03:00
Dawid Mędrek	288be6c82d	db/view: Verify valid configuration for tablet-based views Creating a materialized view or a secondary index in a tablet-based keyspace requires that the user enabled two options: * experimental feature `views-with-tablets`, * configuration option `rf_rack_vaid_keyspaces`. Because the latter has only become a necessity recently (in this series), it's possible that there are already existing materialized views that violate it. We add a new check at start-up that iterates over existing views and makes sure that that is not the case. Otherwise, Scylla notifies the user of the problem.	2025-10-01 09:01:53 +02:00
Nadav Har'El	926089746b	message: move RPC compression from utils/ to message/ The directory utils/ is supposed to contain general-purpose utility classes and functions, which are either already used across the project, or are designed to be used across the project. This patch moves 8 files out of utils/: utils/advanced_rpc_compressor.hh utils/advanced_rpc_compressor.cc utils/advanced_rpc_compressor_protocol.hh utils/stream_compressor.hh utils/stream_compressor.cc utils/dict_trainer.cc utils/dict_trainer.hh utils/shared_dict.hh These 8 files together implement the compression feature of RPC. None of them are used by any other Scylla component (e.g., sstables have a different compression), or are ready to be used by another component, so this patch moves all of them into message/, where RPC is implemented. Theoretically, we may want in the future to use this cluster of classes for some other component, but even then, we shouldn't just have these files individually in utils/ - these are not useful stand-alone utilities. One cannot use "shared_dict.hh" assuming it is some sort of general-purpose shared hash table or something - it is completely specific to compression and zstd, and specifically to its use in those other classes. Beyond moving these 8 files, this patch also contains changes to: 1. Fix includes to the 5 moved header files (.hh). 2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt for the three moved source files (.cc). 3. In the moved files, change from the "utils::" namespace, to the "netw::" namespace used by RPC. Also needed to change a bunch of callers for the new namespace. Also, had to add "utils::" explicitly in several places which previously assumed the current namespace is "utils::". Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25149	2025-09-30 17:03:09 +03:00
Avi Kivity	5b6570be52	Merge 'db/config: Add SSTable compression options for user tables' from Nikos Dragazis ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well. This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality. Fixes #25195. Closes scylladb/scylladb#26003 * github.com:scylladb/scylladb: test/cluster: Add tests for invalid SSTable compression options test/boost: Add tests for SSTable compression config options main: Validate SSTable compression options from config db/config: Add SSTable compression options for user tables db/config: Prepare compression_parameters for config system compressor: Validate presence of sstable_compression in parameters compressor: Add missing space in exception message	2025-09-28 20:23:23 +03:00
Nikos Dragazis	8d5bd212ca	main: Validate SSTable compression options from config `compression_parameters` provides two levels of validation: * syntactic checks - implemented in the constructor * semantic checks - implemented by `compression_parameters::validate()` The former are applied implicitly when parsing the options from the command line or from scylla.yaml. The latter are currently not applied, but they should. In lack of a better place, apply them in main, right after joining the cluster, to make sure that the cluster features have been negotiated. The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation will fail if the feature is disabled and a dictionary compression algorithm has been selected. Also, mark `validate()` as const so that it can be called from a config object. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:42 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Pavel Emelyanov	f6860d1de0	Merge 'mv: run view building worker fibers in streaming group' from Piotr Dulikowski The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong. Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit. No need to backport, view build coordinator is not a part of any release yet. Closes scylladb/scylladb#26122 * github.com:scylladb/scylladb: mv: fix typo in start_backgroud_fibers mv: run view building worker fibers in streaming group	2025-09-22 15:28:38 +03:00
Karol Nowacki	eae71d3e91	vector_store_client: Move to vector_search module Vector search related implementation moved to a new module vector_search. As the vector search functionality is going to be extended, it is better to keep it in a separate module.	2025-09-22 08:01:47 +02:00
Michał Chojnowski	9e70df83ab	db: get rid of sstables-format-selector Our sstable format selection logic is weird, and hard to follow. If I'm not misunderstanding, the pieces are: 1. There's the `sstable_format` config entry, which currently doesn't do anything, but in the past it used to disable cluster features for versions newer than the specified one. 2. There are deprecated and unused config entries for individual versions (`enable_sstables_mc_format`, `enable_sstables_md_format`, etc). 3. There is a cluster feature for each version: ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc. (Currently all sstable version features have been grandfathered, and aren't checked by the code anymore). 4. There's an entry in `system.scylla_local` which contains the latest enabled sstable version. (Why? Isn't this directly derived from cluster features anyway)? 5. There's `sstable_manager::_format` which contains the sstable version to be used for new writes. This field is updated by `sstables_format_selector` based on cluster features and the `system.scylla_local` entry. I don't see why those pieces are needed. Version selection has the following constraints: 1. New sstables must be written with a format that supports existing data. For example, range tombstones with an infinite bound are only supported by sstables since version "mc". So if a range tombstone with an infinite bound exists somewhere in the dataset, the format chosen for new sstables has to be at least as new as "mc". 2. A new format might only be used after a corresponding cluster feature is enabled. (Otherwise new sstables might become unreadable if they are sent to another node, or if a node is downgraded). 3. The user should have a way to inhibit format ugprades if he wishes. So far, constraint (1) has been fulfilled by never using formats older than the newest format ever enabled on the node. (With an exception for resharding and reshaping system tables). Constraint (2) has been fulfilled by calling `sstable_manager::set_format` only after the corresponsing cluster feature is enabled. Constraint (3) has been fulfilled by the ability to inhibit cluster features by setting `sstable_format` by some fixed value. The main thing I don't like about this whole setup is that it doesn't let me downgrade the preferred sstable format. After a format is enabled, there is no way to go back to writing the old format again. That is no good -- after I make some performance-sensitive changes in a new format, it might turn out to be a pessimization for the particular workload, and I want to be able to go back. This patch aims to give a way to downgrade formats without violating the constraints. What it does is: 1. The entry in `system.scylla_local` becomes obsolete. After the patch we no longer update or read it. As far as I understand, the purpose of this entry is to prevent unwanted format downgrades (which is something cluster features are designed for) and it's updated if and only if relevant cluster features are updated. So there's no reason to have it, we can just directly use cluster features. 2. `sstable_format_selector` gets deleted. Without the `system.scylla_local` around, it's just a glorified feature listener. 3. The format selection logic is moved into `sstable_manager`. It already sees the `db::config` and the `gms::feature_service`. For the foreseeable future, the knowledge of enabled cluster features and current config should be enough information to pick the right formats. 4. The `sstable_format` entry in `db::config` is no longer intended to inhibit cluster features. Instead, it is intended to select the format for new sstables, and it becomes live-updatable. 5. Instead of writing new sstables with "highest supported" format, (which used to be set by `sstables_format_selector`) we write them with the "preferred" format, which is determined by `sstable_manager` based on the combination of enabled features and the current value of `sstable_format`. Closes scylladb/scylladb#26092 [avi: Pavel found the reason for the scylla_local entry - it predates stable storage for cluster features]	2025-09-19 16:17:56 +03:00
Pavel Emelyanov	a1ea553fe1	code: Replace distributed<> with sharded<> The latter is recommended in seastar, and the former was left as compatibility alias. Latest seastar explicitly marks it as deprecated so once the submodule is updated, compilation logs will explode. Most of the patch is generated with for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]>') ; do sed -e 's/\<distributed<$[A-Za-z0-9:_]$>/sharded<\1>/g' -i $f; done for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done and a small manual change in test/perf/perf.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26136	2025-09-19 12:22:51 +02:00

1 2 3 4 5 ...

1557 Commits